linux

Author	SHA1	Message	Date
Peter Zijlstra	e51fd5e22e	sched: Fix wake_affine() vs RT tasks Mike reports that since `e9e9250b` (sched: Scale down cpu_power due to RT tasks), wake_affine() goes funny on RT tasks due to them still having a !0 weight and wake_affine() still subtracts that from the rq weight. Since nobody should be using se->weight for RT tasks, set the value to zero. Also, since we now use ->cpu_power to normalize rq weights to account for RT cpu usage, add that factor into the imbalance computation. Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1275316109.27810.22969.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-01 09:27:16 +02:00
Rusty Russell	293a7cfeed	module: fix reference to mod->percpu after freeing module. Rafael sees a sometimes crash at precpu_modfree from kernel/module.c; it only occurred with another (since-reverted) patch, but that patch simply changed timing to uncover this bug, it was otherwise unrelated. The comment about the mod being freed is self-explanatory, but neither Tejun nor I read it. This bug was introduced in `259354deaa`, after it had previously been fixed in `6e2b75740b`. How embarrassing. Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Embarrassingly-Acked-by: Tejun Heo <tj@kernel.org> Cc: Masami Hiramatsu <mhiramat@redhat.com> Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-31 11:21:32 -07:00
Randy Dunlap	546cf44a1b	blktrace: Fix new kernel-doc warnings Fix blktrace.c kernel-doc warnings: Warning(kernel/trace/blktrace.c:858): No description found for parameter 'ignore' Warning(kernel/trace/blktrace.c:890): No description found for parameter 'ignore' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100529114507.c466fc1e.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 09:58:20 +02:00
Frederic Weisbecker	74048f895f	perf_events: Fix unincremented buffer base on partial copy If a sample size crosses to the next page boundary, the copy will be made in more than one step. However we forget to advance the source offset for the next copy, leading to unexpected double copies that completely mess up the traces. This fixes various kinds of bad traces that have irrelevant data inside, as an example: geany-4979 [001] 5758.077775: sched_switch: prev_comm=! prev_pid=121 prev_prio=0 prev_state=S\|D\|Z\|X\|x ==> next_comm= next_pid=7497072 next_prio=0 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274988898-5639-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:10 +02:00
Stephane Eranian	90151c35b1	perf_events: Fix event scheduling issues introduced by transactional API The transactional API patch between the generic and model-specific code introduced several important bugs with event scheduling, at least on X86. If you had pinned events, e.g., watchdog, and were over-committing the PMU, you would get bogus counts. The bug was showing up on Intel CPU because events would move around more often that on AMD. But the problem also existed on AMD, though harder to expose. The issues were: - group_sched_in() was missing a cancel_txn() in the error path - cpuc->n_added was not properly maintained, leading to missing actions in hw_perf_enable(), i.e., n_running being 0. You cannot update n_added until you know the transaction has succeeded. In case of failed transaction n_added was not adjusted back. - in case of failed transactions, event_sched_out() was called and eventually invoked x86_disable_event() to touch the HW reg. But with transactions, on X86, event_sched_in() does not touch HW registers, it simply collects events into a list. Thus, you could end up calling x86_disable_event() on a counter which did not correspond to the current event when idx != -1. The patch modifies the generic and X86 code to avoid all those problems. First, we keep track of the number of events added last. In case the transaction fails, we substract them from n_added. This approach is necessary (as opposed to delaying updates to n_added) because not all event updates use the transaction API, e.g., single events. Second, we encapsulate the event_sched_in() and event_sched_out() in group_sched_in() inside the transaction. That makes the operations symmetrical and you can also detect that you are inside a transaction and skip the HW reg access by checking cpuc->group_flag. With this patch, you can now overcommit the PMU even with pinned system-wide events present and still get valid counts. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274796225.5882.1389.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:10 +02:00
Peter Zijlstra	2e97942fe5	perf_events, trace: Fix perf_trace_destroy(), mutex went missing Steve spotted I forgot to do the destroy under event_mutex. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274451913.1674.1707.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:09 +02:00
Peter Zijlstra	3771f07711	perf_events, trace: Fix probe unregister race tracepoint_probe_unregister() does not synchronize against the probe callbacks, so do that explicitly. This properly serializes the callbacks and the free of the data used therein. Also, use this_cpu_ptr() where possible. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274438476.1674.1702.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:09 +02:00
Peter Zijlstra	8a49542c05	perf_events: Fix races in group composition Group siblings don't pin each-other or the parent, so when we destroy events we must make sure to clean up all cross referencing pointers. In particular, for destruction of a group leader we must be able to find all its siblings and remove their reference to it. This means that detaching an event from its context must not detach it from the group, otherwise we can end up failing to clear all pointers. Solve this by clearly separating the attachment to a context and attachment to a group, and keep the group composed until we destroy the events. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:09 +02:00
Peter Zijlstra	ac9721f3f5	perf_events: Fix races and clean up perf_event and perf_mmap_data interaction In order to move toward separate buffer objects, rework the whole perf_mmap_data construct to be a more self-sufficient entity, one with its own lifetime rules. This greatly sanitizes the whole output redirection code, which was riddled with bugs and races. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:08 +02:00
Amit K. Arora	54e88fad22	sched: Make sure timers have migrated before killing the migration_thread Problem: In a stress test where some heavy tests were running along with regular CPU offlining and onlining, a hang was observed. The system seems to be hung at a point where migration_call() tries to kill the migration_thread of the dying CPU, which just got moved to the current CPU. This migration thread does not get a chance to run (and die) since rt_throttled is set to 1 on current, and it doesn't get cleared as the hrtimer which is supposed to reset the rt bandwidth (sched_rt_period_timer) is tied to the CPU which we just marked dead! Solution: This patch pushes the killing of migration thread to "CPU_POST_DEAD" event. By then all the timers (including sched_rt_period_timer) should have got migrated (along with other callbacks). Signed-off-by: Amit Arora <aarora@in.ibm.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20100525132346.GA14986@amitarora.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:37:44 +02:00
Linus Torvalds	fa7eadab4b	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mutex: Fix optimistic spinning vs. BKL	2010-05-30 12:35:15 -07:00
Linus Torvalds	bc7d352c5e	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tui: Fix last use_browser problem related to .perfconfig perf symbols: Add the build id cache to the vmlinux path perf tui: Reset use_browser if stdout is not a tty ring-buffer: Move zeroing out excess in page to ring buffer code ring-buffer: Reset "real_end" when page is filled	2010-05-30 12:35:01 -07:00
Rafael J. Wysocki	e9a5f426b8	CPU: Avoid using unititialized error variable in disable_nonboot_cpus() If there's only one CPU online when disable_nonboot_cpus() is called, the error variable will not be initialized and that may lead to erroneous behavior. Fix this issue by initializing error in disable_nonboot_cpus() as appropriate. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-30 09:06:00 -07:00
Linus Torvalds	35926ff5fb	Revert "cpusets: randomize node rotor used in cpuset_mem_spread_node()" This reverts commit `0ac0c0d0f8`, which caused cross-architecture build problems for all the wrong reasons. IA64 already added its own version of __node_random(), but the fact is, there is nothing architectural about the function, and the original commit was just badly done. Revert it, since no fix is forthcoming. Requested-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-30 09:00:03 -07:00
Linus Torvalds	b612a05537	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: clean up on forwarded aborted mds request ceph: fix leak of osd authorizer ceph: close out mds, osd connections before stopping auth ceph: make lease code DN specific fs/ceph: Use ERR_CAST ceph: renew auth tickets before they expire ceph: do not resend mon requests on auth ticket renewal ceph: removed duplicated #includes ceph: avoid possible null dereference ceph: make mds requests killable, not interruptible sched: add wait_for_completion_killable_timeout	2010-05-30 08:56:39 -07:00
Sage Weil	0aa12fb439	sched: add wait_for_completion_killable_timeout Add missing _killable_timeout variant for wait_for_completion that will return when a timeout expires or the task is killed. CC: Ingo Molnar <mingo@elte.hu> CC: Andreas Herrmann <andreas.herrmann3@amd.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:30 -07:00
Linus Torvalds	29d03fa12b	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix_timer: Fix error path in timer_create hrtimer: Avoid double seqlock timers: Move local variable into else section timers: Fix slack calculation really	2010-05-28 10:16:27 -07:00
Al Viro	ea635c64e0	Fix racy use of anon_inode_getfd() in perf_event.c once anon_inode_getfd() is called, you can't expect anything about struct file that descriptor points to - another thread might be doing whatever it likes with descriptor table at that point. Cc: stable <stable@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:08 -04:00
Linus Torvalds	c5617b200a	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (61 commits) tracing: Add __used annotation to event variable perf, trace: Fix !x86 build bug perf report: Support multiple events on the TUI perf annotate: Fix up usage of the build id cache x86/mmiotrace: Remove redundant instruction prefix checks perf annotate: Add TUI interface perf tui: Remove annotate from popup menu after failure perf report: Don't start the TUI if -D is used perf: Fix getline undeclared perf: Optimize perf_tp_event_match() perf: Remove more code from the fastpath perf: Optimize the !vmalloc backed buffer perf: Optimize perf_output_copy() perf: Fix wakeup storm for RO mmap()s perf-record: Share per-cpu buffers perf-record: Remove -M perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction perf tui: Allow disabling the TUI on a per command basis in ~/.perfconfig ...	2010-05-27 15:23:47 -07:00
Andrey Vagin	45e0fffc8a	posix_timer: Fix error path in timer_create Move CLOCK_DISPATCH(which_clock, timer_create, (new_timer)) after all posible EFAULT erros. *_timer_create may allocate/get resources. (for example posix_cpu_timer_create does get_task_struct) [ tglx: fold the remove crappy comment patch into this ] Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: <stable@kernel.org> Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-27 22:38:15 +02:00
Linus Torvalds	00b9b0af58	Avoid warning when CPU hotplug isn't enabled Commit `e9fb7631eb` ("cpu-hotplug: introduce cpu_notify(), __cpu_notify(), cpu_notify_nofail()") also introduced this annoying warning: kernel/cpu.c:157: warning: 'cpu_notify_nofail' defined but not used when CONFIG_HOTPLUG_CPU wasn't set. So move that helper inside the #ifdef CONFIG_HOTPLUG_CPU region, and simplify it while at it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 10:32:08 -07:00
Lee Schermerhorn	3dd6b5fb43	numa: in-kernel profiling: use cpu_to_mem() for per cpu allocations In kernel profiling requires that we be able to allocate "local" memory for each cpu. Use "cpu_to_mem()" instead of "cpu_to_node()" to support memoryless nodes. Depends on the "numa_mem_id()" patch. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:57 -07:00
Anton Blanchard	5b530fc183	panic: call console_verbose() in panic Most distros turn the console verbosity down and that means a backtrace after a panic never makes it to the console. I assume we haven't seen this because a panic is often preceeded by an oops which will have called console_verbose. There are however a lot of places we call panic directly, and they are broken. Use console_verbose like we do in the oops path to ensure a directly called panic will print a backtrace. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Oleg Nesterov	f106eee100	pids: fix fork_idle() to setup ->pids correctly copy_process(pid => &init_struct_pid) doesn't do attach_pid/etc. It shouldn't, but this means that the idle threads run with the wrong pids copied from the caller's task_struct. In x86 case the caller is either kernel_init() thread or keventd. In particular, this means that after the series of cpu_up/cpu_down an idle thread (which never exits) can run with .pid pointing to nowhere. Change fork_idle() to initialize idle->pids[] correctly. We only set .pid = &init_struct_pid but do not add .node to list, INIT_TASK() does the same for the boot-cpu idle thread (swapper). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Mathias Krause <Mathias.Krause@secunet.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:52 -07:00
Hedi Berriche	72680a191b	pids: increase pid_max based on num_possible_cpus On a system with a substantial number of processors, the early default pid_max of 32k will not be enough. A system with 1664 CPU's, there are 25163 processes started before the login prompt. It's estimated that with 2048 CPU's we will pass the 32k limit. With 4096, we'll reach that limit very early during the boot cycle, and processes would stall waiting for an available pid. This patch increases the early maximum number of pids available, and increases the minimum number of pids that can be set during runtime. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Hedi Berriche <hedi@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Robin Holt <holt@sgi.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Machek <pavel@ucw.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: John Stoffel <john@stoffel.org> Cc: Jack Steiner <steiner@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:51 -07:00
Lai Jiangshan	79a6cdeb7e	cpuhotplug: do not need cpu_hotplug_begin() when CONFIG_HOTPLUG_CPU=n Since when CONFIG_HOTPLUG_CPU=n, get_online_cpus() do nothing, so we don't need cpu_hotplug_begin() either. This patch moves cpu_hotplug_begin()/cpu_hotplug_done() into the code block of CONFIG_HOTPLUG_CPU=y. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:48 -07:00
Akinobu Mita	80b5184cc5	kernel/: convert cpu notifier to return encapsulate errno value By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for kernel/*.c Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:48 -07:00
Akinobu Mita	e6bde73b07	cpu-hotplug: return better errno on cpu hotplug failure Currently, onlining or offlining a CPU failure by one of the cpu notifiers error always cause -EINVAL error. (i.e. writing 0 or 1 to /sys/devices/system/cpu/cpuX/online gets EINVAL) To get better error reporting rather than always getting -EINVAL, This changes cpu_notify() to return -errno value with notifier_to_errno() and fix the callers. Now that cpu notifiers can return encapsulate errno value. Currently, all cpu hotplug notifiers return NOTIFY_OK, NOTIFY_BAD, or NOTIFY_DONE. So cpu_notify() can returns 0 or -EPERM with this change for now. (notifier_to_errno(NOTIFY_OK) == 0, notifier_to_errno(NOTIFY_DONE) == 0, notifier_to_errno(NOTIFY_BAD) == -EPERM) Forthcoming patches convert several cpu notifiers to return encapsulate errno value with notifier_from_errno(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Akinobu Mita	e9fb7631eb	cpu-hotplug: introduce cpu_notify(), __cpu_notify(), cpu_notify_nofail() No functional change. These are just wrappers of raw_cpu_notifier_call_chain. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	b3ac022cb9	proc: turn signal_struct->count into "int nr_threads" No functional changes, just s/atomic_t count/int nr_threads/. With the recent changes this counter has a single user, get_nr_threads() And, none of its callers need the really accurate number of threads, not to mention each caller obviously races with fork/exit. It is only used to report this value to the user-space, except first_tid() uses it to avoid the unnecessary while_each_thread() loop in the unlikely case. It is a bit sad we need a word in struct signal_struct for this, perhaps we can change get_nr_threads() to approximate the number of threads using signal->live and kill ->nr_threads later. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	5089a97680	proc_sched_show_task(): use get_nr_threads() Trivial, use get_nr_threads() helper to read signal->count which we are going to change. Like other callers, proc_sched_show_task() doesn't need the exactly precise nr_threads. David said: : Note that get_nr_threads() isn't completely equivalent (it can return 0 : where proc_sched_show_task() will display a 1). But I don't think this : should be a problem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	6e1be45aa6	check_unshare_flags: kill the bogus CLONE_SIGHAND/sig->count check check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the task is multithreaded to ensure unshare_thread() will fail. Not only this is a bit strange way to return the error, this is absolutely meaningless. If signal->count > 1 then sighand->count must be also > 1, and unshare_sighand() will fail anyway. In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not look right. Fortunately this code doesn't really work anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	97101eb41d	exit: move taskstats_tgid_free() from __exit_signal() to free_signal_struct() Move taskstats_tgid_free() from __exit_signal() to free_signal_struct(). This way signal->stats never points to nowhere and we can read ->stats lockless. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	a705be6b5e	kill the obsolete thread_group_cputime_free() helper Kill the empty thread_group_cputime_free() helper. It was needed to free the per-cpu data which we no longer have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	d40e48e02f	exit: __exit_signal: use thread_group_leader() consistently Cleanup: - Add the boolean, group_dead = thread_group_leader(), for clarity. - Do not test/set sig == NULL to detect the all-dead case, use this boolean. - Pass this boolen to __unhash_process() and use it instead of another thread_group_leader() call which needs ->group_leader. This can be considered as microoptimization, but hopefully this also allows us do do other cleanups later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	b7b8ff6373	signals: kill the awful task_rq_unlock_wait() hack Now that task->signal can't go away we can revert the horrible hack added by `ad474caca3` ("fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock"). And we can do more cleanups sched_stats.h/posix-cpu-timers.c later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	4ada856fb0	signals: clear signal->tty when the last thread exits When the last thread exits signal->tty is freed, but the pointer is not cleared and points to nowhere. This is OK. Nobody should use signal->tty lockless, and it is no longer possible to take ->siglock. However this looks wrong even if correct, and the nice OOPS is better than subtle and hard to find bugs. Change __exit_signal() to clear signal->tty under ->siglock. Note: __exit_signal() needs more cleanups. It should not check "sig != NULL" to detect the all-dead case and we have the same issues with signal->stats. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	ea6d290ca3	signals: make task_struct->signal immutable/refcountable We have a lot of problems with accessing task_struct->signal, it can "disappear" at any moment. Even current can't use its ->signal safely after exit_notify(). ->siglock helps, but it is not convenient, not always possible, and sometimes it makes sense to use task->signal even after this task has already dead. This patch adds the reference counter, sigcnt, into signal_struct. This reference is owned by task_struct and it is dropped in __put_task_struct(). Perhaps it makes sense to export get/put_signal_struct() later, but currently I don't see the immediate reason. Rename __cleanup_signal() to free_signal_struct() and unexport it. With the previous changes it does nothing except kmem_cache_free(). Change __exit_signal() to not clear/free ->signal, it will be freed when the last reference to any thread in the thread group goes away. Note: - when the last thead exits signal->tty can point to nowhere, see the next patch. - with or without this patch signal_struct->count should go away, or at least it should be "int nr_threads" for fs/proc. This will be addressed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	4dec2a91fd	fork/exit: move tty_kref_put() outside of __cleanup_signal() tty_kref_put() has two callsites in copy_process() paths, 1. if copy_process() suceeds it is called before we copy signal->tty from parent 2. otherwise it is called from __cleanup_signal() under bad_fork_cleanup_signal: label In both cases tty_kref_put() is not right and unneeded because we don't have the balancing tty_kref_get(). Fortunately, this is harmless because this can only happen without CLONE_THREAD, and in this case signal->tty must be NULL. Remove tty_kref_put() from copy_process() and __cleanup_signal(), and change another caller of __cleanup_signal(), __exit_signal(), to call tty_kref_put() by hand. I hope this change makes sense by itself, but it is also needed to make ->signal refcountable. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alan Cox <alan@linux.intel.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	d30fda3551	posix-cpu-timers: avoid "task->signal != NULL" checks Preparation to make task->signal immutable, no functional changes. posix-cpu-timers.c checks task->signal != NULL to ensure this task is alive and didn't pass __exit_signal(). This is correct but we are going to change the lifetime rules for ->signal and never reset this pointer. Change the code to check ->sighand instead, it doesn't matter which pointer we check under tasklist, they both are cleared simultaneously. As Roland pointed out, some of these changes are not strictly needed and probably it makes sense to revert them later, when ->signal will be pinned to task_struct. But this patch tries to ensure the subsequent changes in fork/exit can't make any visible impact on posix cpu timers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	4a59994297	exit: avoid sig->count in __exit_signal() to detect the group-dead case Change __exit_signal() to check thread_group_leader() instead of atomic_dec_and_test(&sig->count). This must be equivalent, the group leader must be released only after all other threads have exited and passed __exit_signal(). Henceforth sig->count is not actually used, except in fs/proc for get_nr_threads/etc. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	d344193a05	exit: avoid sig->count in de_thread/__exit_signal synchronization de_thread() and __exit_signal() use signal_struct->count/notify_count for synchronization. We can simplify the code and use ->notify_count only. Instead of comparing these two counters, we can change de_thread() to set ->notify_count = nr_of_sub_threads, then change __exit_signal() to dec-and-test this counter and notify group_exit_task. Note that __exit_signal() checks "notify_count > 0" just for symmetry with exit_notify(), we could just check it is != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	09faef11df	exit: change zap_other_threads() to count sub-threads Change zap_other_threads() to return the number of other sub-threads found on ->thread_group list. Other changes are cosmetic: - change the code to use while_each_thread() helper - remove the obsolete comment about SIGKILL/SIGSTOP Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	9c33916844	exit: exit_notify() can trust signal->notify_count < 0 signal_struct->count in its current form must die. - it has no reasons to be atomic_t - it looks like a reference counter, but it is not - otoh, we really need to make task->signal refcountable, just look at the extremely ugly task_rq_unlock_wait() called from __exit_signals(). - we should change the lifetime rules for task->signal, it should be pinned to task_struct. We have a lot of code which can be simplified after that. - it is not needed! while the code is correct, any usage of this counter is artificial, except fs/proc uses it correctly to show the number of threads. This series removes the usage of sig->count from exit pathes. This patch: Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify() can just check notify_count < 0 to ensure the execing sub-threads needs the notification from us. No need to do other checks, notify_count != 0 must always mean ->group_exit_task != NULL is waiting for us. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	04b1c384fb	call_usermodehelper: UMH_WAIT_EXEC ignores kernel_thread() failure UMH_WAIT_EXEC should report the error if kernel_thread() fails, like UMH_WAIT_PROC does. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	d47419cd96	call_usermodehelper: simplify/fix UMH_NO_WAIT case __call_usermodehelper(UMH_NO_WAIT) has 2 problems: - if kernel_thread() fails, call_usermodehelper_freeinfo() is not called. - for unknown reason UMH_NO_WAIT has UMH_WAIT_PROC logic, we spawn yet another thread which waits until the user mode application exits. Change the UMH_NO_WAIT code to use ____call_usermodehelper() instead of wait_for_helper(), and do call_usermodehelper_freeinfo() unconditionally. We can rely on CLONE_VFORK, do_fork(CLONE_VFORK) until the child exits or execs. With or without this patch UMH_NO_WAIT does not report the error if kernel_thread() fails, this is correct since the caller doesn't wait for result. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	7d64224217	wait_for_helper: SIGCHLD from user-space can lead to use-after-free 1. wait_for_helper() calls allow_signal(SIGCHLD) to ensure the child can't autoreap itself. However, this means that a spurious SIGCHILD from user-space can set TIF_SIGPENDING and: - kernel_thread() or sys_wait4() can fail due to signal_pending() - worse, wait4() can fail before ____call_usermodehelper() execs or exits. In this case the caller may kfree(subprocess_info) while the child still uses this memory. Change the code to use SIG_DFL instead of magic "(void __user *)2" set by allow_signal(). This means that SIGCHLD won't be delivered, yet the child won't autoreap itsefl. The problem is minor, only root can send a signal to this kthread. 2. If sys_wait4(&ret) fails it doesn't populate "ret", in this case wait_for_helper() reports a random value from uninitialized var. With this patch sys_wait4() should never fail, but still it makes sense to initialize ret = -ECHILD so that the caller can notice the problem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	363da4022c	call_usermodehelper: no need to unblock signals ____call_usermodehelper() correctly calls flush_signal_handlers() to set SIG_DFL, but sigemptyset(->blocked) and recalc_sigpending() are not needed. This kthread was forked by workqueue thread, all signals must be unblocked and ignored, no pending signal is possible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	c70a626d3e	umh: creds: kill subprocess_info->cred logic Now that nobody ever changes subprocess_info->cred we can kill this member and related code. ____call_usermodehelper() always runs in the context of freshly forked kernel thread, it has the proper ->cred copied from its parent kthread, keventd. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	685bfd2c48	umh: creds: convert call_usermodehelper_keys() to use subprocess_info->init() call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change subprocess_info->cred in advance. Now that we have info->init() we can change this code to set tgcred->session_keyring in context of execing kernel thread. Note: since currently call_usermodehelper_keys() is never called with UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup() are not really needed, we could rely on install_session_keyring_to_cred() which does key_get() on success. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Neil Horman	898b374af6	exec: replace call_usermodehelper_pipe with use of umh init function and resolve limit The first patch in this series introduced an init function to the call_usermodehelper api so that processes could be customized by caller. This patch takes advantage of that fact, by customizing the helper in do_coredump to create the pipe and set its core limit to one (for our recusrsion check). This lets us clean up the previous uglyness in the usermodehelper internals and factor call_usermodehelper out entirely. While I'm at it, we can also modify the helper setup to look for a core limit value of 1 rather than zero for our recursion check Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Neil Horman	a06a4dc3a0	kmod: add init function to usermodehelper About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe feature in the kernel works. We had reports of several races, including some reports of apps bypassing our recursion check so that a process that was forked as part of a core_pattern setup could infinitely crash and refork until the system crashed. We fixed those by improving our recursion checks. The new check basically refuses to fork a process if its core limit is zero, which works well. Unfortunately, I've been getting grief from maintainer of user space programs that are inserted as the forked process of core_pattern. They contend that in order for their programs (such as abrt and apport) to work, all the running processes in a system must have their core limits set to a non-zero value, to which I say 'yes'. I did this by design, and think thats the right way to do things. But I've been asked to ease this burden on user space enough times that I thought I would take a look at it. The first suggestion was to make the recursion check fail on a non-zero 'special' number, like one. That way the core collector process could set its core size ulimit to 1, and enable the kernel's recursion detection. This isn't a bad idea on the surface, but I don't like it since its opt-in, in that if a program like abrt or apport has a bug and fails to set such a core limit, we're left with a recursively crashing system again. So I've come up with this. What I've done is modify the call_usermodehelper api such that an extra parameter is added, a function pointer which will be called by the user helper task, after it forks, but before it exec's the required process. This will give the caller the opportunity to get a call back in the processes context, allowing it to do whatever it needs to to the process in the kernel prior to exec-ing the user space code. In the case of do_coredump, this callback is ues to set the core ulimit of the helper process to 1. This elimnates the opt-in problem that I had above, as it allows the ulimit for core sizes to be set to the value of 1, which is what the recursion check looks for in do_coredump. This patch: Create new function call_usermodehelper_fns() and allow it to assign both an init and cleanup function, as we'll as arbitrary data. The init function is called from the context of the forked process and allows for customization of the helper process prior to calling exec. Its return code gates the continuation of the process, or causes its exit. Also add an arbitrary data pointer to the subprocess_info struct allowing for data to be passed from the caller to the new process, and the subsequent cleanup process Also, use this patch to cleanup the cleanup function. It currently takes an argp and envp pointer for freeing, which is ugly. Lets instead just make the subprocess_info structure public, and pass that to the cleanup and init routines Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Oleg Nesterov	065add3941	signals: check_kill_permission(): don't check creds if same_thread_group() Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the notification from the helper thread races with setresuid(), see http://samba.org/~tridge/junkcode/aio_uid.c This happens because check_kill_permission() doesn't permit sending a signal to the task with the different cred->xids. But there is not any security reason to check ->cred's when the task sends a signal (private or group-wide) to its sub-thread. Whatever we do, any thread can bypass all security checks and send SIGKILL to all threads, or it can block a signal SIG and do kill(gettid(), SIG) to deliver this signal to another sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM. Change check_kill_permission() to avoid the credentials check when the sender and the target are from the same thread group. Also, move "cred = current_cred()" down to avoid calling get_current() twice. Note: David Howells pointed out we could relax this even more, the CLONE_SIGHAND (without CLONE_THREAD) case probably does not need these checks too. Roland said: : The glibc (libpthread) that does set*id across threads has : been in use for a while (2.3.4?), probably in distro's using kernels as old : or older than any active -stable streams. In the race in question, this : kernel bug is breaking valid POSIX application expectations. Reported-by: Andrew Tridgell <tridge@samba.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Eric Paris <eparis@parisplace.org> Cc: Jakub Jelinek <jakub@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Roland McGrath <roland@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: <stable@kernel.org> [all kernel versions] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Oleg Nesterov	e0129ef91e	ptrace: PTRACE_GETFDPIC: fix the unsafe usage of child->mm Now that Mike Frysinger unified the FDPIC ptrace code, we can fix the unsafe usage of child->mm in ptrace_request(PTRACE_GETFDPIC). We have the reference to task_struct, and ptrace_check_attach() verified the tracee is stopped. But nothing can protect from SIGKILL after that, we must not assume child->mm != NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Greg Ungerer <gerg@snapgear.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Mike Frysinger	9c1a125921	ptrace: unify FDPIC implementations The Blackfin/FRV/SuperH guys all have the same exact FDPIC ptrace code in their arch handlers (since they were probably copied & pasted). Since these ptrace interfaces are an arch independent aspect of the FDPIC code, unify them in the common ptrace code so new FDPIC ports don't need to copy and paste this fundamental stuff yet again. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Jack Steiner	0ac0c0d0f8	cpusets: randomize node rotor used in cpuset_mem_spread_node() Some workloads that create a large number of small files tend to assign too many pages to node 0 (multi-node systems). Part of the reason is that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at node 0 for newly created tasks. This patch changes the rotor to be initialized to a random node number of the cpuset. [akpm@linux-foundation.org: fix layout] [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration] Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Jack Steiner	6adef3ebe5	cpusets: new round-robin rotor for SLAB allocations We have observed several workloads running on multi-node systems where memory is assigned unevenly across the nodes in the system. There are numerous reasons for this but one is the round-robin rotor in cpuset_mem_spread_node(). For example, a simple test that writes a multi-page file will allocate pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it allocates on odd nodes & skips even nodes). An example is shown below. The program "lfile" writes a file consisting of 10 pages. The program then mmaps the file & uses get_mempolicy(..., MPOL_F_NODE) to determine the nodes where the file pages were allocated. The output is shown below: # ./lfile allocated on nodes: 2 4 6 0 1 2 6 0 2 There is a single rotor that is used for allocating both file pages & slab pages. Writing the file allocates both a data page & a slab page (buffer_head). This advances the RR rotor 2 nodes for each page allocated. A quick confirmation seems to confirm this is the cause of the uneven allocation: # echo 0 >/dev/cpuset/memory_spread_slab # ./lfile allocated on nodes: 6 7 8 9 0 1 2 3 4 5 This patch introduces a second rotor that is used for slab allocations. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Kirill A. Shutemov	907860ed38	cgroups: make cftype.unregister_event() void-returning Since we are unable to handle an error returned by cftype.unregister_event() properly, let's make the callback void-returning. mem_cgroup_unregister_event() has been rewritten to be a "never fail" function. On mem_cgroup_usage_register_event() we save old buffer for thresholds array and reuse it in mem_cgroup_usage_unregister_event() to avoid allocation. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Stanislaw Gruszka	174bd1994e	hrtimer: Avoid double seqlock hrtimer_get_softirq_time() has it's own xtime lock protection, so it's safe to use plain __current_kernel_time() and avoid the double seqlock loop. Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> LKML-Reference: <20100525214912.GA1934@r2bh72.net.upc.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-26 16:15:37 +02:00
Thomas Gleixner	2abfb9e1d4	timers: Move local variable into else section Fix nit-picking coding style detail. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-26 16:07:13 +02:00
Linus Torvalds	b1cdc4670b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits) drivers/net/usb/asix.c: Fix pointer cast. be2net: Bug fix to avoid disabling bottom half during firmware upgrade. proc_dointvec: write a single value hso: add support for new products Phonet: fix potential use-after-free in pep_sock_close() ath9k: remove VEOL support for ad-hoc ath9k: change beacon allocation to prefer the first beacon slot sock.h: fix kernel-doc warning cls_cgroup: Fix build error when built-in macvlan: do proper cleanup in macvlan_common_newlink() V2 be2net: Bug fix in init code in probe net/dccp: expansion of error code size ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep wireless: fix sta_info.h kernel-doc warnings wireless: fix mac80211.h kernel-doc warnings iwlwifi: testing the wrong variable in iwl_add_bssid_station() ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs() ath9k_htc: dereferencing before check in hif_usb_tx_cb() rt2x00: Fix rt2800usb TX descriptor writing. rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions. ...	2010-05-25 16:59:51 -07:00
Linus Torvalds	218ce73514	Revert "module: drop the lock while waiting for module to complete initialization." This reverts commit `480b02df3a`, since Rafael reports that it causes occasional kernel paging request faults in load_module(). Dropping the module lock and re-taking it deep in the call-chain is definitely not the right thing to do. That just turns the mutex from a lock into a "random non-locking data structure" that doesn't actually protect what it's supposed to protect. Requested-and-tested-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Brandon Philips <brandon@ifup.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 16:48:30 -07:00
J. R. Okajima	563b046710	proc_dointvec: write a single value The commit `00b7c3395a` "sysctl: refactor integer handling proc code" modified the behaviour of writing to /proc. Before the commit, write("1\n") to /proc/sys/kernel/printk succeeded. But now it returns EINVAL. This commit supports writing a single value to a multi-valued entry. Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp> Reviewed-and-tested-by: WANG Cong <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-25 16:10:14 -07:00
Thomas Gleixner	8e63d7795e	timers: Fix slack calculation really commit `f00e047ef` (timers: Fix slack calculation for expired timers) fixed the issue of slack on expired timers only partially. Linus noticed that jiffies is volatile so it is reloaded twice, which generates bad code. But its worse. This can defeat the time_after() check if jiffies are incremented between time_after() and the slack calculation. Fix it by reading jiffies into a local variable, which prevents the compiler from loading it twice. While at it make the > -1 check into >= 0 which is easier to read. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 21:07:48 +02:00
Steven Rostedt	2711ca237a	ring-buffer: Move zeroing out excess in page to ring buffer code Currently the trace splice code zeros out the excess bytes in the page before sending it off to userspace. This is to make sure userspace is not getting anything it should not be when reading the pages, because the excess data was never initialized to zero before writing (for perfomance reasons). But the splice code has no business in doing this work, it should be done by the ring buffer. With the latest changes for recording lost events, the splice code gets it wrong anyway. Move the zeroing out of excess bytes into the ring buffer code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-25 11:57:26 -04:00
Steven Rostedt	b3230c8b44	ring-buffer: Reset "real_end" when page is filled The code to store the "lost events" requires knowing the real end of the page. Since the 'commit' includes the padding at the end of a page a "real_end" variable was used to keep track of the end not including the padding. If events were lost, the reader can place the count of events in the padded area if there is enough room. The bug this patch fixes is that when we fill the page we do not reset the real_end variable, and if the writer had wrapped a few times, the real_end would be incorrect. This patch simply resets the real_end if the page was filled. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-25 11:57:24 -04:00
Andy Shevchenko	69e4469a39	sysctl: don't use own implementation of hex_to_bin() Remove own implementation of hex_to_bin(). Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:05 -07:00
Wenji Huang	7d52669b14	module: remove duplicate declaration of __ksymtab_gpl_future Minor cleanup on duplicate __{start/stop}__ksymtab_gpl_future. Signed-off-by: Wenji Huang <wenji.huang@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:04 -07:00
Haicheng Li	4eaf3f6439	mem-hotplug: fix potential race while building zonelist for new populated zone Add global mutex zonelists_mutex to fix the possible race: CPU0 CPU1 CPU2 (1) zone->present_pages += online_pages; (2) build_all_zonelists(); (3) alloc_page(); (4) free_page(); (5) build_all_zonelists(); (6) __build_all_zonelists(); (7) zone->pageset = alloc_percpu(); In step (3,4), zone->pageset still points to boot_pageset, so bad things may happen if 2+ nodes are in this state. Even if only 1 node is accessing the boot_pageset, (3) may still consume too much memory to fail the memory allocations in step (7). Besides, atomic operation ensures alloc_percpu() in step (7) will never fail since there is a new fresh memory block added in step(6). [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists] Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Andi Kleen <andi.kleen@intel.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:02 -07:00
Haicheng Li	1f522509c7	mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset For each new populated zone of hotadded node, need to update its pagesets with dynamically allocated per_cpu_pageset struct for all possible CPUs: 1) Detach zone->pageset from the shared boot_pageset at end of __build_all_zonelists(). 2) Use mutex to protect zone->pageset when it's still shared in onlined_pages() Otherwises, multiple zones of different nodes would share same boot strapping boot_pageset for same CPU, which will finally cause below kernel panic: ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:1239! invalid opcode: 0000 [#1] SMP ... Call Trace: [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0 [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0 [<ffffffff81128407>] __page_cache_alloc+0x67/0x70 [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260 [<ffffffff81132751>] ra_submit+0x21/0x30 [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0 [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0 [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670 [<ffffffff81266cfa>] nfs_file_read+0xca/0x130 [<ffffffff8117b20a>] do_sync_read+0xfa/0x140 [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0 [<ffffffff8117c151>] sys_read+0x51/0x80 [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900 RSP <ffff88000d1e78a8> ---[ end trace 4bda28328b9990db ] [akpm@linux-foundation.org: merge fix] Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Andi Kleen <andi.kleen@intel.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:01 -07:00
minskey guo	cf23422b9d	cpu/mem hotplug: enable CPUs online before local memory online Enable users to online CPUs even if the CPUs belongs to a numa node which doesn't have onlined local memory. The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either in system boot/init period, or at the time of local memory online. For a numa node without onlined local memory, its zonelists are not initialized at present. As a result, any memory allocation operations executed by CPUs within this node will fail. In fact, an out-of-memory error is triggered when attempt to online CPUs before memory comes to online. This patch tries to create zonelists for such numa nodes, so that the memory allocation for this node can be fallback'ed to other nodes. [akpm@linux-foundation.org: remove unneeded export] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: minskey guo<chaohong.guo@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:00 -07:00
Mel Gorman	5e77190580	mm: compaction: add a tunable that decides when memory should be compacted and when it should be reclaimed The kernel applies some heuristics when deciding if memory should be compacted or reclaimed to satisfy a high-order allocation. One of these is based on the fragmentation. If the index is below 500, memory will not be compacted. This choice is arbitrary and not based on data. To help optimise the system and set a sensible default for this value, this patch adds a sysctl extfrag_threshold. The kernel will only compact memory if the fragmentation index is above the extfrag_threshold. [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:59 -07:00
Mel Gorman	76ab0f530e	mm: compaction: add /proc trigger for memory compaction Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is written to the file, all zones are compacted. The expected user of such a trigger is a job scheduler that prepares the system before the target application runs. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:59 -07:00
Miao Xie	c0ff7453bb	cpuset,mm: fix no node to alloc memory when changing cpuset's mems Before applying this patch, cpuset updates task->mems_allowed and mempolicy by setting all new bits in the nodemask first, and clearing all old unallowed bits later. But in the way, the allocator may find that there is no node to alloc memory. The reason is that cpuset rebinds the task's mempolicy, it cleans the nodes which the allocater can alloc pages on, for example: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom This patch fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. [akpm@linux-foundation.org: fix spello] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:57 -07:00
Miao Xie	708c1bbc9d	mempolicy: restructure rebinding-mempolicy functions Nick Piggin reported that the allocator may see an empty nodemask when changing cpuset's mems[1]. It happens only on the kernel that do not do atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG) But I found that there is also a problem on the kernel that can do atomic nodemask_t stores. The problem is that the allocator can't find a node to alloc page when changing cpuset's mems though there is a lot of free memory. The reason is like this: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom I can use the attached program reproduce it by the following step: # mkdir /dev/cpuset # mount -t cpuset cpuset /dev/cpuset # mkdir /dev/cpuset/1 # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems # echo $$ > /dev/cpuset/1/tasks # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> & <nr_tasks> = max(nr_cpus - 1, 1) # killall -s SIGUSR1 cpuset_mem_hog # ./change_mems.sh several hours later, oom will happen though there is a lot of free memory. This patchset fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. This patch: In order to fix no node to alloc memory, when we want to update mempolicy and mems_allowed, we expand the set of nodes first (set all the newly nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the mempolicy's rebind functions may breaks the expanding. So we restructure the mempolicy's rebind functions and split the rebind work to two steps, just like the update of cpuset's mems: The 1st step: expand the set of the mempolicy's nodes. The 2nd step: shrink the set of the mempolicy's nodes. It is used when there is no real lock to protect the mempolicy in the read-side. Otherwise we can do rebind work at once. In order to implement it, we define enum mpol_rebind_step { MPOL_REBIND_ONCE, MPOL_REBIND_STEP1, MPOL_REBIND_STEP2, MPOL_REBIND_NSTEP, }; If the mempolicy needn't be updated by two steps, we can pass MPOL_REBIND_ONCE to the rebind functions. Or we can pass MPOL_REBIND_STEP1 to do the first step of the rebind work and pass MPOL_REBIND_STEP2 to do the second step work. Besides that, it maybe long time between these two step and we have to release the lock that protects mempolicy and mems_allowed. If we hold the lock once again, we must check whether the current mempolicy is under the rebinding (the first step has been done) or not, because the task may alloc a new mempolicy when we don't hold the lock. So we defined the following flag to identify it: #define MPOL_F_REBINDING (1 << 2) The new functions will be used in the next patch. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:57 -07:00
Peter Zijlstra	87f44bbc24	perf, trace: Fix !x86 build bug Patch `b7e2ecef92` (perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction) made the unfortunate mistake of assuming the world is x86 only, correct this. The problem was that perf_fetch_caller_regs() did local_save_flags() into regs->flags, and I re-used that to remove another local_save_flags(), forgetting !x86 doesn't have regs->flags. Do the reverse, remove the local_save_flags() from perf_fetch_caller_regs() and let the ftrace site do the local_save_flags() instead. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Cc: acme@redhat.com Cc: efault@gmx.de Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1274778175.5882.623.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-25 11:28:49 +02:00
Jeff Chua	f00e047efd	timers: Fix slack calculation for expired timers commit `3bbb9ec946` (timers: Introduce the concept of timer slack for legacy timers) does not take the case into account when the timer is already expired. This broke wireless drivers. The solution is not to apply slack to already expired timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com>	2010-05-24 12:10:23 +02:00
Thomas Gleixner	bd45b7a385	timekeeping: Fix timezone update commit `64ce4c2f` (time: Clean up warp_clock()) breaks the timezone update in a very subtle way. To avoid the direct access to timekeeping internals it adds the timezone delta to the current time with timespec_add_safe(). This works nicely when the timezone delta is > 0. If timezone delta is < 0 then the wrap check in timespec_add_safe() triggers and timespec_add_safe() returns TIME_MAX and screws up timekeeping completely. The comment above timespec_add_safe() says: It's assumed that both values are valid (>= 0) Add the timezone seconds adjustment directly. Reported-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-24 11:50:38 +02:00
Linus Torvalds	6109e2ce26	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (36 commits) PCI: hotplug: pciehp: Removed check for hotplug of display devices PCI: read memory ranges out of Broadcom CNB20LE host bridge PCI: Allow manual resource allocation for PCI hotplug bridges x86/PCI: make ACPI MCFG reserved error messages ACPI specific PCI hotplug: Use kmemdup PM/PCI: Update PCI power management documentation PCI: output FW warning in pci_read/write_vpd PCI: fix typos pci_device_dis/enable to pci_dis/enable_device in comments PCI quirks: disable msi on AMD rs4xx internal gfx bridges PCI: Disable MSI for MCP55 on P5N32-E SLI x86/PCI: irq and pci_ids patch for additional Intel Cougar Point DeviceIDs PCI: aerdrv: trivial cleanup for aerdrv_core.c PCI: aerdrv: trivial cleanup for aerdrv.c PCI: aerdrv: introduce default_downstream_reset_link PCI: aerdrv: rework find_aer_service PCI: aerdrv: remove is_downstream PCI: aerdrv: remove magical ROOT_ERR_STATUS_MASKS PCI: aerdrv: redefine PCI_ERR_ROOT_*_SRC PCI: aerdrv: rework do_recovery PCI: aerdrv: rework get_e_source() ...	2010-05-21 18:58:52 -07:00
Linus Torvalds	0961d6581c	Merge git://git.infradead.org/iommu-2.6 * git://git.infradead.org/iommu-2.6: intel-iommu: Set a more specific taint flag for invalid BIOS DMAR tables intel-iommu: Combine the BIOS DMAR table warning messages panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') panic: Allow warnings to set different taint flags intel-iommu: intel_iommu_map_range failed at very end of address space intel-iommu: errors with smaller iommu widths intel-iommu: Fix boot inside 64bit virtualbox with io-apic disabled intel-iommu: use physfn to search drhd for VF intel-iommu: Print out iommu seq_id intel-iommu: Don't complain that ACPI_DMAR_SCOPE_TYPE_IOAPIC is not supported intel-iommu: Avoid global flushes with caching mode. intel-iommu: Use correct domain ID when caching mode is enabled intel-iommu mistakenly uses offset_pfn when caching mode is enabled intel-iommu: use for_each_set_bit() intel-iommu: Fix section mismatch dmar_ir_support() uses dmar_tbl.	2010-05-21 17:25:01 -07:00
Linus Torvalds	a8251096b4	Merge branch 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: drop the lock while waiting for module to complete initialization. MODULE_DEVICE_TABLE(isapnp, ...) does nothing hisax_fcpcipnp: fix broken isapnp device table. isapnp: move definitions to mod_devicetable.h so file2alias can reach them.	2010-05-21 17:15:44 -07:00
Linus Torvalds	6e80e8ed5e	Merge branch 'for-2.6.35' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.35' of git://git.kernel.dk/linux-2.6-block: (86 commits) pipe: set lower and upper limit on max pages in the pipe page array pipe: add support for shrinking and growing pipes drbd: This is now equivalent to drbd release 8.3.8rc1 drbd: Do not free p_uuid early, this is done in the exit code of the receiver drbd: Null pointer deref fix to the large "multi bio rewrite" drbd: Fix: Do not detach, if a bio with a barrier fails drbd: Ensure to not trigger late-new-UUID creation multiple times drbd: Do not Oops when C_STANDALONE when uuid gets generated writeback: fix mixed up arguments to bdi_start_writeback() writeback: fix problem with !CONFIG_BLOCK compilation block: improve automatic native capacity unlocking block: use struct parsed_partitions *state universally in partition check code block,ide: simplify bdops->set_capacity() to ->unlock_native_capacity() block: restart partition scan after resizing a device buffer: make invalidate_bdev() drain all percpu LRU add caches block: remove all rcu head initializations writeback: fixups for !dirty_writeback_centisecs writeback: bdi_writeback_task() must set task state before calling schedule() writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync drivers/block/drbd: Use kzalloc ...	2010-05-21 15:25:33 -07:00
Randy Dunlap	0fc377bd64	sysctl: fix kernel-doc notation and typos Fix kernel-doc warnings, kernel-doc special characters, and typos in recent kernel/sysctl.c additions. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Amerigo Wang <amwang@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-21 15:23:12 -07:00
Linus Torvalds	2a8ba8f032	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (46 commits) random: simplify fips mode crypto: authenc - Fix cryptlen calculation crypto: talitos - add support for sha224 crypto: talitos - add hash algorithms crypto: talitos - second prepare step for adding ahash algorithms crypto: talitos - prepare for adding ahash algorithms crypto: n2 - Add Niagara2 crypto driver crypto: skcipher - Add ablkcipher_walk interfaces crypto: testmgr - Add testing for async hashing and update/final crypto: tcrypt - Add speed tests for async hashing crypto: scatterwalk - Fix scatterwalk_done() test crypto: hifn_795x - Rename ablkcipher_walk to hifn_cipher_walk padata: Use get_online_cpus/put_online_cpus in padata_free padata: Add some code comments padata: Flush the padata queues actively padata: Use a timer to handle remaining objects in the reorder queues crypto: shash - Remove usage of CRYPTO_MINALIGN crypto: mv_cesa - Use resource_size crypto: omap - OMAP macros corrected padata: Use get_online_cpus/put_online_cpus ... Fix up conflicts in arch/arm/mach-omap2/devices.c	2010-05-21 14:46:51 -07:00
Jens Axboe	ee9a3607fb	Merge branch 'master' into for-2.6.35 Conflicts: fs/ext3/fsync.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:27:26 +02:00
Jens Axboe	b492e95be0	pipe: set lower and upper limit on max pages in the pipe page array We need at least two to guarantee proper POSIX behaviour, so never allow a smaller limit than that. Also expose a /proc/sys/fs/pipe-max-pages sysctl file that allows root to define a sane upper limit. Make it default to 16 times the default size, which is 16 pages. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:12:52 +02:00
Jens Axboe	35f3d14dbb	pipe: add support for shrinking and growing pipes This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for growing and shrinking the size of a pipe and adjusts pipe.c and splice.c (and relay and network splice) usage to work with these larger (or smaller) pipes. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:12:40 +02:00
Linus Torvalds	ac3ee84c60	Merge branch 'dbg-early-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'dbg-early-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: echi-dbgp: Add kernel debugger support for the usb debug port earlyprintk,vga,kdb: Fix \b and \r for earlyprintk=vga with kdb kgdboc: Add ekgdboc for early use of the kernel debugger x86,early dr regs,kgdb: Allow kernel debugger early dr register access x86,kgdb: Implement early hardware breakpoint debugging x86, kgdb, init: Add early and late debug states x86, kgdb: early trap init for early debug	2010-05-21 11:10:41 -07:00
Linus Torvalds	90b9a32d8f	Merge branch 'kdb-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'kdb-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: (25 commits) kdb,debug_core: Allow the debug core to receive a panic notification MAINTAINERS: update kgdb, kdb, and debug_core info debug_core,kdb: Allow the debug core to process a recursive debug entry printk,kdb: capture printk() when in kdb shell kgdboc,kdb: Allow kdb to work on a non open console port kgdb: Add the ability to schedule a breakpoint via a tasklet mips,kgdb: kdb low level trap catch and stack trace powerpc,kgdb: Introduce low level trap catching x86,kgdb: Add low level debug hook kgdb: remove post_primary_code references kgdb,docs: Update the kgdb docs to include kdb kgdboc,keyboard: Keyboard driver for kdb with kgdb kgdb: gdb "monitor" -> kdb passthrough sparc,sunzilog: Add console polling support for sunzilog serial driver sh,sh-sci: Use NO_POLL_CHAR in the SCIF polled console code kgdb,8250,pl011: Return immediately from console poll kgdb: core changes to support kdb kdb: core for kgdb back end (2 of 2) kdb: core for kgdb back end (1 of 2) kgdb,blackfin: Add in kgdb_arch_set_pc for blackfin ...	2010-05-21 11:08:05 -07:00
Chris Wright	2c3c8bea60	sysfs: add struct file* to bin_attr callbacks This allows bin_attr->read,write,mmap callbacks to check file specific data (such as inode owner) as part of any privilege validation. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-21 09:37:31 -07:00
Peter Zijlstra	1704f47b50	lockdep: Add novalidate class for dev->mutex conversion The conversion of device->sem to device->mutex resulted in lockdep warnings. Create a novalidate class for now until the driver folks come up with separate classes. That way we have at least the basic mutex debugging coverage. Add a checkpatch error so the usage is reserved for device->mutex. [ tglx: checkpatch and compile fix for LOCKDEP=n ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-21 09:37:30 -07:00
NeilBrown	db1afffab0	kref: remove kref_set Of the three uses of kref_set in the kernel: One really should be kref_put as the code is letting go of a reference, Two really should be kref_init because the kref is being initialised. This suggests that making kref_set available encourages bad code. So fix the three uses and remove kref_set completely. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-21 09:37:29 -07:00
Steven Rostedt	ff5f149b6a	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-7 Conflicts: include/linux/ftrace_event.h include/trace/ftrace.h kernel/trace/trace_event_perf.c kernel/trace/trace_kprobe.c kernel/trace/trace_syscalls.c Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-21 11:49:57 -04:00
Linus Torvalds	33cf23b0a5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (182 commits) [SCSI] aacraid: add an ifdef'd device delete case instead of taking the device offline [SCSI] aacraid: prohibit access to array container space [SCSI] aacraid: add support for handling ATA pass-through commands. [SCSI] aacraid: expose physical devices for models with newer firmware [SCSI] aacraid: respond automatically to volumes added by config tool [SCSI] fcoe: fix fcoe module ref counting [SCSI] libfcoe: FIP Keep-Alive messages for VPorts are sent with incorrect port_id and wwn [SCSI] libfcoe: Fix incorrect MAC address clearing [SCSI] fcoe: fix a circular locking issue with rtnl and sysfs mutex [SCSI] libfc: Move the port_id into lport [SCSI] fcoe: move link speed checking into its own routine [SCSI] libfc: Remove extra pointer check [SCSI] libfc: Remove unused fc_get_host_port_type [SCSI] fcoe: fixes wrong error exit in fcoe_create [SCSI] libfc: set seq_id for incoming sequence [SCSI] qla2xxx: Updates to ISP82xx support. [SCSI] qla2xxx: Optionally disable target reset. [SCSI] qla2xxx: ensure flash operation and host reset via sg_reset are mutually exclusive [SCSI] qla2xxx: Silence bogus warning by gcc for wrap and did. [SCSI] qla2xxx: T10 DIF support added. ...	2010-05-21 07:19:18 -07:00
Peter Zijlstra	580d607cd6	perf: Optimize perf_tp_event_match() Since we know tracepoints come from kernel context, avoid conditionals that try and establish that very fact. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.904944001@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:38:00 +02:00
Peter Zijlstra	a94ffaaf55	perf: Remove more code from the fastpath Sanity checks cost instructions. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.852926930@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:59 +02:00
Peter Zijlstra	3cafa9fbb5	perf: Optimize the !vmalloc backed buffer Reduce code and data by using the knowledge that for !PERF_USE_VMALLOC data_order is always 0. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.795019386@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:59 +02:00
Peter Zijlstra	5d967a8be6	perf: Optimize perf_output_copy() Reduce the clutter in perf_output_copy() by keeping an interator in perf_output_handle. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.742809176@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:59 +02:00
Peter Zijlstra	adb8e118f2	perf: Fix wakeup storm for RO mmap()s RO mmap()s don't update the tail pointer, so comparing against it for determining the written data size doesn't really do any good. Keep track of when we last did a wakeup, and compare against that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.684479310@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:58 +02:00
Peter Zijlstra	0f139300c9	perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers Since we want to ensure buffers only have a single writer, we must avoid creating one with multiple. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.528215873@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:57 +02:00
Peter Zijlstra	1c024eca51	perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events Avoid the swevent hash-table by using per-tracepoint hlists. Also, avoid conditionals on the fast path by ordering with probe unregister so that we should never get on the callback path without the data being there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.473188012@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:56 +02:00
Peter Zijlstra	b7e2ecef92	perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction Improves performance. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1274259525.5605.10352.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:56 +02:00
Linus Torvalds	7a9b149212	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (229 commits) USB: remove unused usb_buffer_alloc and usb_buffer_free macros usb: musb: update gfp/slab.h includes USB: ftdi_sio: fix legacy SIO-device header USB: kl5usb105: reimplement using generic framework USB: kl5usb105: minor clean ups USB: kl5usb105: fix memory leak USB: io_ti: use kfifo to implement write buffering USB: io_ti: remove unsused private counter USB: ti_usb: use kfifo to implement write buffering USB: ir-usb: fix incorrect write-buffer length USB: aircable: fix incorrect write-buffer length USB: safe_serial: straighten out read processing USB: safe_serial: reimplement read using generic framework USB: safe_serial: reimplement write using generic framework usb-storage: always print quirks USB: usb-storage: trivial debug improvements USB: oti6858: use port write fifo USB: oti6858: use kfifo to implement write buffering USB: cypress_m8: use kfifo to implement write buffering USB: cypress_m8: remove unused drain define ... Fix up conflicts (due to usb_buffer_alloc/free renaming) in drivers/input/tablet/acecad.c drivers/input/tablet/kbtab.c drivers/input/tablet/wacom_sys.c drivers/media/video/gspca/gspca.c sound/usb/usbaudio.c	2010-05-20 21:26:12 -07:00
Linus Torvalds	f8965467f3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1674 commits) qlcnic: adding co maintainer ixgbe: add support for active DA cables ixgbe: dcb, do not tag tc_prio_control frames ixgbe: fix ixgbe_tx_is_paused logic ixgbe: always enable vlan strip/insert when DCB is enabled ixgbe: remove some redundant code in setting FCoE FIP filter ixgbe: fix wrong offset to fc_frame_header in ixgbe_fcoe_ddp ixgbe: fix header len when unsplit packet overflows to data buffer ipv6: Never schedule DAD timer on dead address ipv6: Use POSTDAD state ipv6: Use state_lock to protect ifa state ipv6: Replace inet6_ifaddr->dead with state cxgb4: notify upper drivers if the device is already up when they load cxgb4: keep interrupts available when the ports are brought down cxgb4: fix initial addition of MAC address cnic: Return SPQ credit to bnx2x after ring setup and shutdown. cnic: Convert cnic_local_flags to atomic ops. can: Fix SJA1000 command register writes on SMP systems bridge: fix build for CONFIG_SYSFS disabled ARCNET: Limit com20020 PCI ID matches for SOHARD cards ... Fix up various conflicts with pcmcia tree drivers/net/ {pcmcia/3c589_cs.c, wireless/orinoco/orinoco_cs.c and wireless/orinoco/spectrum_cs.c} and feature removal (Documentation/feature-removal-schedule.txt). Also fix a non-content conflict due to pm_qos_requirement getting renamed in the PM tree (now pm_qos_request) in net/mac80211/scan.c	2010-05-20 21:04:44 -07:00
Jason Wessel	0b4b3827db	x86, kgdb, init: Add early and late debug states The kernel debugger can operate well before mm_init(), but the x86 hardware breakpoint code which uses the perf api requires that the kernel allocators are initialized. This means the kernel debug core needs to provide an optional arch specific call back to allow the initialization functions to run after the kernel has been further initialized. The kdb shell already had a similar restriction with an early initialization and late initialization. The kdb_init() was moved into the debug core's version of the late init which is called dbg_late_init(); CC: kgdb-bugreport@lists.sourceforge.net Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:29 -05:00
Jason Wessel	4402c153cb	kdb,debug_core: Allow the debug core to receive a panic notification It is highly desirable to trap into kdb on panic. The debug core will attempt to register as the first in line for the panic notifier. CC: Ingo Molnar <mingo@elte.hu> CC: Andrew Morton <akpm@linux-foundation.org> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:28 -05:00
Jason Wessel	6d90634076	debug_core,kdb: Allow the debug core to process a recursive debug entry This allows kdb to debug a crash with in the kms code with a single level recursive re-entry. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:27 -05:00
Jason Wessel	d37d39ae3b	printk,kdb: capture printk() when in kdb shell Certain calls from the kdb shell will call out to printk(), and any of these calls should get vectored back to the kdb_printf() so that the kdb pager and processing can be used, as well as to properly channel I/O to the polled I/O devices. CC: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>	2010-05-20 21:04:27 -05:00
Jason Wessel	efe2f29e32	kgdboc,kdb: Allow kdb to work on a non open console port If kdb is open on a serial port that is not actually a console make sure to call the poll routines to emit and receive characters. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:26 -05:00
Jason Wessel	1cee5e35f1	kgdb: Add the ability to schedule a breakpoint via a tasklet Some kgdb I/O modules require the ability to create a breakpoint tasklet, such as kgdboc and external modules such as kgdboe. The breakpoint tasklet is used as an asynchronous entry point into the debugger which will have a different function scope than the current execution path where it might not be safe to have an inline breakpoint. This is true of some of the kgdb I/O drivers which share code with kgdb and rest of the kernel users. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:26 -05:00
Jason Wessel	f503b5ae53	x86,kgdb: Add low level debug hook The only way the debugger can handle a trap in inside rcu_lock, notify_die, or atomic_notifier_call_chain without a triple fault is to have a low level "first opportunity handler" in the int3 exception handler. Generally this will be something the vast majority of folks will not need, but for those who need it, it is added as a kernel .config option called KGDB_LOW_LEVEL_TRAP. CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: H. Peter Anvin <hpa@zytor.com> CC: x86@kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:25 -05:00
Jason Wessel	98ec1878ca	kgdb: remove post_primary_code references Remove all the references to the kgdb_post_primary_code. This function serves no useful purpose because you can obtain the same information from the "struct kgdb_state *ks" from with in the debugger, if for some reason you want the data. Also remove the unintentional duplicate assignment for ks->ex_vector. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:25 -05:00
Jason Wessel	ada64e4c98	kgdboc,keyboard: Keyboard driver for kdb with kgdb This patch adds in the kdb PS/2 keyboard driver. This was mostly a direct port from the original kdb where I cleaned up the code against checkpatch.pl and added the glue to stitch it into kgdb. This patch also enables early kdb debug via kgdbwait and the keyboard. All the access to configure kdb using either a serial console or the keyboard is done via kgdboc. If you want to use only the keyboard and want to break in early you would add to your kernel command arguments: kgdboc=kbd kgdbwait If you wanted serial and or the keyboard access you could use: kgdboc=kbd,ttyS0 You can also configure kgdboc as a kernel module or at run time with the sysfs where you can activate and deactivate kgdb. Turn it on: echo kbd,ttyS0 > /sys/module/kgdboc/parameters/kgdboc Turn it off: echo "" > /sys/module/kgdboc/parameters/kgdboc Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>	2010-05-20 21:04:24 -05:00
Jason Wessel	a0de055cf6	kgdb: gdb "monitor" -> kdb passthrough One of the driving forces behind integrating another front end (kdb) to the debug core is to allow front end commands to be accessible via gdb's monitor command. It is true that you could write gdb macros to get certain data, but you may want to just use gdb to access the commands that are available in the kdb front end. This patch implements the Rcmd gdb stub packet. In gdb you access this with the "monitor" command. For instance you could type "monitor help", "monitor lsmod" or "monitor ps A" etc... There is no error checking or command restrictions on what you can and cannot access at this point. Doing something like trying to set breakpoints with the monitor command is going to cause nothing but problems. Perhaps in the future only the commands that are actually known to work with the gdb monitor command will be available. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:24 -05:00
Jason Wessel	f5316b4aea	kgdb,8250,pl011: Return immediately from console poll The design of the kdb shell requires that every device that can provide input to kdb have a polling routine that exits immediately if there is no character available. This is required in order to get the page scrolling mechanism working. Changing the kernel debugger I/O API to require all polling character routines to exit immediately if there is no data allows the kernel debugger to process multiple input channels. NO_POLL_CHAR will be the return code to the polling routine when ever there is no character available. CC: linux-serial@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:22 -05:00
Jason Wessel	dcc7871128	kgdb: core changes to support kdb These are the minimum changes to the kgdb core in order to enable an API to connect a new front end (kdb) to the debug core. This patch introduces the dbg_kdb_mode variable controls where the user level I/O is routed. It will be routed to the gdbstub (kgdb) or to the kdb front end which is a simple shell available over the kgdboc connection. You can switch back and forth between kdb or the gdb stub mode of operation dynamically. From gdb stub mode you can blindly type "$3#33", or from the kdb mode you can enter "kgdb" to switch to the gdb stub. The logic in the debug core depends on kdb to look for the typical gdb connection sequences and return immediately with KGDB_PASS_EVENT if a gdb serial command sequence is detected. That should allow a reasonably seamless transition between kdb -> gdb without leaving the kernel exception state. The two gdb serial queries that kdb is responsible for detecting are the "?" and "qSupported" packets. CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:21 -05:00
Jason Wessel	67fc4e0cb9	kdb: core for kgdb back end (2 of 2) This patch contains the hooks and instrumentation into kernel which live outside the kernel/debug directory, which the kdb core will call to run commands like lsmod, dmesg, bt etc... CC: linux-arch@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:21 -05:00
Jason Wessel	5d5314d679	kdb: core for kgdb back end (1 of 2) This patch contains only the kdb core. Because the change set was large, it was split. The next patch in the series includes the instrumentation into the core kernel which are mainly helper functions for kdb. This work is directly derived from kdb v4.4 found at: ftp://oss.sgi.com/projects/kdb/download/v4.4/ The kdb internals have been re-organized to make them mostly platform independent and to connect everything to the debug core which is used by gdbstub (which has long been known as kgdb). The original version of kdb was 58,000 lines worth of changes to support x86. From that implementation only the kdb shell, and basic commands for memory access, runcontrol, lsmod, and dmesg where carried forward. This is a generic implementation which aims to cover all the current architectures using the kgdb core: ppc, arm, x86, mips, sparc, sh and blackfin. More archictectures can be added by implementing the architecture specific kgdb functions. [mort@sgi.com: Compile fix with hugepages enabled] [mort@sgi.com: Clean breakpoint code renaming kdba_ -> kdb_] [mort@sgi.com: fix new line after printing registers] [mort@sgi.com: Remove the concept of global vs. local breakpoints] [mort@sgi.com: Rework kdb_si_swapinfo to use more generic name] [mort@sgi.com: fix the information dump macros, remove 'arch' from the names] [sfr@canb.auug.org.au: include fixup to include linux/slab.h] CC: linux-arch@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:20 -05:00
Jason Wessel	53197fc495	Separate the gdbstub from the debug core Split the former kernel/kgdb.c into debug_core.c which contains the kernel debugger exception logic and to the gdbstub.c which contains the logic for allowing gdb to talk to the debug core. This also created a private include file called debug_core.h which contains all the definitions to glue the debug_core to any other debugger connections. CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:19 -05:00
Jason Wessel	c433820971	Move kernel/kgdb.c to kernel/debug/debug_core.c Move kgdb.c in preparation to separate the gdbstub from the debug core and exception handling. CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:18 -05:00
Michal Nazarewicz	22c43c81a5	wait_event_interruptible_locked() interface New wait_event_interruptible{,_exclusive}_locked{,_irq} macros added. They work just like versions without _locked* suffix but require the wait queue's lock to be held. Also __wake_up_locked() is now exported as to pair it with the above macros. The use case of this new facility is when one uses wait queue's lock to protect a data structure. This may be advantageous if the structure needs to be protected by a spinlock anyway. In particular, with additional spinlock the following code has to be used to wait for a condition: spin_lock(&data.lock); ... for (ret = 0; !ret && !(condition); ) { spin_unlock(&data.lock); ret = wait_event_interruptible(data.wqh, (condition)); spin_lock(&data.lock); } ... spin_unlock(&data.lock); This looks bizarre plus wait_event_interruptible() locks the wait queue's lock anyway so there is a unlock+lock sequence where it could be avoided. To avoid those problems and benefit from wait queue's lock, a code similar to the following should be used: /* Waiting / spin_lock(&data.wqh.lock); ... ret = wait_event_interruptible_locked(data.wqh, (condition)); ... spin_unlock(&data.wqh.lock); / Waiting exclusively / spin_lock(&data.whq.lock); ... ret = wait_event_interruptible_exclusive_locked(data.whq, (condition)); ... spin_unlock(&data.whq.lock); / Waking up / spin_lock(&data.wqh.lock); ... wake_up_locked(&data.wqh); ... spin_unlock(&data.wqh.lock); When spin_lock_irq() is used matching versions of macros need to be used (_locked_irq()). Signed-off-by: Michal Nazarewicz <m.nazarewicz@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Takashi Iwai <tiwai@suse.de> Cc: David Howells <dhowells@redhat.com> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-20 13:21:42 -07:00
Frederic Weisbecker	acd35a463c	perf: Fix forgotten preempt_enable by nested writers A writer that gets a reference to the buffer handle disables preemption. When we put that reference, we check if we are the outer most writer and if not, we simply return and defer the head update to the outer most writer. The problem here is that preemption is only reenabled by the outer most, that produces preemption count imbalance for every nested writer that exit. So just don't forget to always re-enable preemption when we put the buffer reference, whoever we are. Fixes lots of sleeping in atomic warnings, visible with lock events recording. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com>	2010-05-20 21:28:34 +02:00
Linus Torvalds	a0fe3cc5d3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits) Input: psmouse - small formatting changes to better follow coding style Input: synaptics - set dimensions as reported by firmware Input: elantech - relax signature checks Input: elantech - enforce common prefix on messages Input: wistron_btns - switch to using kmemdup() Input: usbtouchscreen - switch to using kmemdup() Input: do not force selecting i8042 on Moorestown Input: Documentation/sysrq.txt - update KEY_SYSRQ info Input: 88pm860x_onkey - remove invalid irq number assignment Input: i8042 - add a PNP entry to the aux device list Input: i8042 - add some extra PNP keyboard types Input: wm9712 - fix wm97xx_set_gpio() logic Input: add keypad driver for keys interfaced to TCA6416 Input: remove obsolete {corgi,spitz,tosa}kbd.c Input: kbtab - do not advertise unsupported events Input: kbtab - simplify kbtab_disconnect() Input: kbtab - fix incorrect size parameter in usb_buffer_free Input: acecad - don't advertise mouse events Input: acecad - fix some formatting issues Input: acecad - simplify usb_acecad_disconnect() ... Trivial conflict in Documentation/feature-removal-schedule.txt	2010-05-20 10:33:06 -07:00
Linus Torvalds	f39d01be4c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits) vlynq: make whole Kconfig-menu dependant on architecture add descriptive comment for TIF_MEMDIE task flag declaration. EEPROM: max6875: Header file cleanup EEPROM: 93cx6: Header file cleanup EEPROM: Header file cleanup agp: use NULL instead of 0 when pointer is needed rtc-v3020: make bitfield unsigned PCI: make bitfield unsigned jbd2: use NULL instead of 0 when pointer is needed cciss: fix shadows sparse warning doc: inode uses a mutex instead of a semaphore. uml: i386: Avoid redefinition of NR_syscalls fix "seperate" typos in comments cocbalt_lcdfb: correct sections doc: Change urls for sparse Powerpc: wii: Fix typo in comment i2o: cleanup some exit paths Documentation/: it's -> its where appropriate UML: Fix compiler warning due to missing task_struct declaration UML: add kernel.h include to signal.c ...	2010-05-20 09:20:59 -07:00
Linus Torvalds	46ee964509	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: PM QOS update fix Freezer / cgroup freezer: Update stale locking comments PM / platform_bus: Allow runtime PM by default i2c: Fix bus-level power management callbacks PM QOS update PM / Hibernate: Fix block_io.c printk warning PM / Hibernate: Group swap ops PM / Hibernate: Move the first_sector out of swsusp_write PM / Hibernate: Separate block_io PM / Hibernate: Snapshot cleanup FS / libfs: Implement simple_write_to_buffer PM / Hibernate: document open(/dev/snapshot) side effects PM / Runtime: Add sysfs debug files PM: Improve device power management document PM: Update device power management document PM: Allow runtime_suspend methods to call pm_schedule_suspend() PM: pm_wakeup - switch to using bool	2010-05-20 09:03:55 -07:00
Linus Torvalds	fa5312d9e8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: change cancel_work_sync() to clear work->data workqueue: warn about flush_scheduled_work()	2010-05-20 09:03:02 -07:00
Linus Torvalds	96b5b7f4f2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (61 commits) KEYS: Return more accurate error codes LSM: Add __init to fixup function. TOMOYO: Add pathname grouping support. ima: remove ACPI dependency TPM: ACPI/PNP dependency removal security/selinux/ss: Use kstrdup TOMOYO: Use stack memory for pending entry. Revert "ima: remove ACPI dependency" Revert "TPM: ACPI/PNP dependency removal" KEYS: Do preallocation for __key_link() TOMOYO: Use mutex_lock_interruptible. KEYS: Better handling of errors from construct_alloc_key() KEYS: keyring_serialise_link_sem is only needed for keyring->keyring links TOMOYO: Use GFP_NOFS rather than GFP_KERNEL. ima: remove ACPI dependency TPM: ACPI/PNP dependency removal selinux: generalize disabling of execmem for plt-in-heap archs LSM Audit: rename LSM_AUDIT_NO_AUDIT to LSM_AUDIT_DATA_NONE CRED: Holding a spinlock does not imply the holding of RCU read lock SMACK: Don't #include Ext2 headers ...	2010-05-20 08:55:50 -07:00
Ingo Molnar	dfacc4d6c9	Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-05-20 14:38:55 +02:00
Frederic Weisbecker	49f135ed02	perf: Comply with new rcu checks API The software events hlist doesn't fully comply with the new rcu checks api. We need to consider three different sides that access the hlist: - the hlist allocation/release side. This side happens when an events is created or released, accesses to the hlist are serialized under the cpuctx mutex. - the events insertion/removal in the hlist. This side is always serialized against the above one. The hlist is always present during such operations. This side happens when a software event is scheduled in/out. The serialization that ensures the software event is really attached to the context is made under the ctx->lock. - events triggering. This is the read side, it can happen concurrently with any update side. This patch deals with them one by one and anticipates with the separate rcu mem space patches in preparation. This patch fixes various annoying rcu warnings. Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org>	2010-05-20 10:40:37 +02:00
Linus Torvalds	164d44fd92	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Add clocksource_register_hz/khz interface posix-cpu-timers: Optimize run_posix_cpu_timers() time: Remove xtime_cache mqueue: Convert message queue timeout to use hrtimers hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME timers: Introduce the concept of timer slack for legacy timers ntp: Remove tickadj ntp: Make time_adjust static time: Add xtime, wall_to_monotonic to feature-removal-schedule timer: Try to survive timer callback preempt_count leak timer: Split out timer function call timer: Print function name for timer callbacks modifying preemption count time: Clean up warp_clock() cpu-timers: Avoid iterating over all threads in fastpath_timer_check() cpu-timers: Change SIGEV_NONE timer implementation cpu-timers: Return correct previous timer reload value cpu-timers: Cleanup arm_timer() cpu-timers: Simplify RLIMIT_CPU handling	2010-05-19 17:11:10 -07:00
Linus Torvalds	6e0b7b2c39	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Clear CPU mask in affinity_hint when none is provided genirq: Add CPU mask affinity hint genirq: Remove IRQF_DISABLED from core code genirq: Run irq handlers with interrupts disabled genirq: Introduce request_any_context_irq() genirq: Expose irq_desc->node in proc/irq Fixed up trivial conflicts in Documentation/feature-removal-schedule.txt	2010-05-19 17:09:40 -07:00
KOSAKI Motohiro	fa9dc265ac	cpumask: fix compat getaffinity Commit `a45185d2d` "cpumask: convert kernel/compat.c" broke libnuma, which abuses sched_getaffinity to find out NR_CPUS in order to parse /sys/devices/system/node/node*/cpumap. On NUMA systems with less than 32 possibly CPUs, the current compat_sys_sched_getaffinity now returns '4' instead of the actual NR_CPUS/8, which makes libnuma bail out when parsing the cpumap. The libnuma call sched_getaffinity(0, bitmap, 4096) at first. It mean the libnuma expect the return value of sched_getaffinity() is either len argument or NR_CPUS. But it doesn't expect to return nr_cpu_ids. Strictly speaking, userland requirement are 1) Glibc assume the return value mean the lengh of initialized of mask argument. E.g. if sched_getaffinity(1024) return 128, glibc make zero fill rest 896 byte. 2) Libnuma assume the return value can be used to guess NR_CPUS in kernel. It assume len-arg<NR_CPUS makes -EINVAL. But it try len=4096 at first and 4096 is always bigger than NR_CPUS. Then, if we remove strange min_length normalization, we never hit -EINVAL case. sched_getaffinity() already solved this issue. This patch adapts compat_sys_sched_getaffinity() to match the non-compat case. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Ken Werner <ken.werner@web.de> Cc: stable@kernel.org Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-19 11:48:18 -07:00
Linus Torvalds	ba0234ec35	Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (24 commits) [S390] drivers/s390/char: Use kmemdup [S390] drivers/s390/char: Use kstrdup [S390] debug: enable exception-trace debug facility [S390] s390_hypfs: Add new attributes [S390] qdio: remove API wrappers [S390] qdio: set correct bit in dsci [S390] qdio: dont convert timestamps to microseconds [S390] qdio: remove memset hack [S390] qdio: prevent starvation on PCI devices [S390] qdio: count number of qdio interrupts [S390] user space fault: report fault before calling do_exit [S390] topology: expose core identifier [S390] dasd: remove uid from devmap [S390] dasd: add dynamic pav toleration [S390] vdso: add missing vdso_install target [S390] vdso: remove redundant check for CONFIG_64BIT [S390] avoid default_llseek in s390 drivers [S390] vmcp: disallow modular build [S390] add breaking event address for user space [S390] virtualization aware cpu measurement ...	2010-05-19 11:35:30 -07:00
Dmitry Torokhov	8d0bc2b456	Merge commit 'v2.6.34' into next	2010-05-19 10:12:41 -07:00
Don Zickus	26e09c6eee	lockup_detector: Convert per_cpu to __get_cpu_var for readability Just a bunch of conversions as suggested by Frederic W. __get_cpu_var() provides preemption disabled checks. Plus it gives more readability as it makes it obvious we are dealing locally now with these vars. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <1274133966-18415-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-19 11:32:14 +02:00
Rusty Russell	480b02df3a	module: drop the lock while waiting for module to complete initialization. This fixes "gave up waiting for init of module libcrc32c." which happened at boot time due to multiple parallel module loads. The problem was a deadlock: we wait for a module to finish initializing, but we keep the module_lock mutex so it can't complete. In particular, this could reasonably happen if a module does a request_module() in its initialization routine. So we change use_module() to return an errno rather than a bool, and if it's -EBUSY we drop the lock and wait in the caller, then reaquire the lock. Reported-by: Brandon Philips <brandon@ifup.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: Brandon Philips <brandon@ifup.org>	2010-05-19 17:33:39 +09:30
Ben Hutchings	92946bc72f	panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') This taint flag will initially be used when warning about invalid ACPI DMAR tables. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2010-05-19 08:37:43 +01:00
Ben Hutchings	b2be05273a	panic: Allow warnings to set different taint flags WARN() is used in some places to report firmware or hardware bugs that are then worked-around. These bugs do not affect the stability of the kernel and should not set the flag for TAINT_WARN. To allow for this, add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint number as argument. Architectures that implement warnings using trap instructions instead of calls to warn_slowpath_*() now implement __WARN_TAINT(taint) instead of __WARN(). Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Helge Deller <deller@gmx.de> Tested-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2010-05-19 08:36:48 +01:00
Tony Breeds	fd6be105b8	mutex: Fix optimistic spinning vs. BKL Currently, we can hit a nasty case with optimistic spinning on mutexes: CPU A tries to take a mutex, while holding the BKL CPU B tried to take the BLK while holding the mutex This looks like a AB-BA scenario but in practice, is allowed and happens due to the auto-release on schedule() nature of the BKL. In that case, the optimistic spinning code can get us into a situation where instead of going to sleep, A will spin waiting for B who is spinning waiting for A, and the only way out of that loop is the need_resched() test in mutex_spin_on_owner(). This patch fixes it by completely disabling spinning if we own the BKL. This adds one more detail to the extensive list of reasons why it's a bad idea for kernel code to be holding the BKL. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: <stable@kernel.org> LKML-Reference: <20100519054636.GC12389@ozlabs.org> [ added an unlikely() attribute to the branch ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-19 08:18:44 +02:00
David S. Miller	2ec8c6bb5d	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: include/linux/mod_devicetable.h scripts/mod/file2alias.c	2010-05-18 23:01:55 -07:00
Steffen Klassert	3789ae7dcd	padata: Use get_online_cpus/put_online_cpus in padata_free Add get_online_cpus/put_online_cpus to ensure that no cpu goes offline during the flushing of the padata percpu queues. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:45:35 +10:00
Steffen Klassert	0198ffd135	padata: Add some code comments Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:44:27 +10:00
Steffen Klassert	2b73b07ab8	padata: Flush the padata queues actively yield was used to wait until all references of the internal control structure in use are dropped before it is freed. This patch implements padata_flush_queues which actively flushes the padata percpu queues in this case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:43:46 +10:00
Steffen Klassert	d46a5ac7a7	padata: Use a timer to handle remaining objects in the reorder queues padata_get_next needs to check whether the next object that need serialization must be parallel processed by the local cpu. This check was wrong implemented and returned always true, so the try_again loop in padata_reorder was never taken. This can lead to object leaks in some rare cases due to a race that appears with the trylock in padata_reorder. The try_again loop was not a good idea after all, because a cpu could take that loop frequently, so we handle this with a timer instead. This patch adds a timer to handle the race that appears with the trylock. If cpu1 queues an object to the reorder queue while cpu2 holds the pd->lock but left the while loop in padata_reorder already, cpu2 can't care for this object and cpu1 exits because it can't get the lock. Usually the next cpu that takes the lock cares for this object too. We need the timer just if this object was the last one that arrives to the reorder queues. The timer function sends it out in this case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:43:14 +10:00
Peter Zijlstra	6d1acfd5c6	perf: Optimize perf_output_*() by avoiding local_xchg() Since the x86 XCHG ins implies LOCK, avoid the use by using a sequence count instead. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:49 +02:00
Peter Zijlstra	fa5881514e	perf: Optimize the hotpath by converting the perf output buffer to local_t Since there is now only a single writer, we can use local_t instead and avoid all these pesky LOCK insn. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:49 +02:00
Peter Zijlstra	ef60777c9a	perf: Optimize the perf_output() path by removing IRQ-disables Since we can now assume there is only a single writer to each buffer, we can remove per-cpu lock thingy and use a simply nest-count to the same effect. This removes the need to disable IRQs. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:48 +02:00
Peter Zijlstra	c7920614ce	perf: Disallow mmap() on per-task inherited events Since we now have working per-task-per-cpu events for a while, disallow mmap() on per-task inherited events. Those things were a performance problem anyway, and doing away with it allows us to optimize the buffer somewhat by assuming there is only a single writer. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:48 +02:00
Peter Zijlstra	a19d35c11f	perf: Optimize buffer placement by allocating buffers NUMA aware Ensure cpu bound buffers live on the right NUMA node. Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1274114880.5605.5236.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:47 +02:00
Stephane Eranian	00d1d0b095	perf: Fix errors path in perf_output_begin() In case the sampling buffer has no "payload" pages, nr_pages is 0. The problem is that the error path in perf_output_begin() skips to a label which assumes perf_output_lock() has been issued which is not the case. That triggers a WARN_ON() in perf_output_unlock(). This patch fixes the problem by skipping perf_output_unlock() in case data->nr_pages is 0. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4bf13674.014fd80a.6c82.ffffb20c@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:47 +02:00
Peter Zijlstra	4f41c013f5	perf/ftrace: Optimize perf/tracepoint interaction for single events When we've got but a single event per tracepoint there is no reason to try and multiplex it so don't. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:46 +02:00
Linus Torvalds	752f114fb8	Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix "integer as NULL pointer" warning. tracing: Fix tracepoint.h DECLARE_TRACE() to allow more than one header tracing: Make the documentation clear on trace_event boot option ring-buffer: Wrap open-coded WARN_ONCE tracing: Convert nop macros to static inlines tracing: Fix sleep time function profiling tracing: Show sample std dev in function profiling tracing: Add documentation for trace commands mod, traceon/traceoff ring-buffer: Make benchmark handle missed events ring-buffer: Make non-consuming read less expensive with lots of cpus. tracing: Add graph output support for irqsoff tracer tracing: Have graph flags passed in to ouput functions tracing: Add ftrace events for graph tracer tracing: Dump either the oops's cpu source or all cpus buffers tracing: Fix uninitialized variable of tracing/trace output	2010-05-18 08:35:04 -07:00
Linus Torvalds	b8ae30ee26	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits) stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback() sched, wait: Use wrapper functions sched: Remove a stale comment ondemand: Make the iowait-is-busy time a sysfs tunable ondemand: Solve a big performance issue by counting IOWAIT time as busy sched: Intoduce get_cpu_iowait_time_us() sched: Eliminate the ts->idle_lastupdate field sched: Fold updating of the last_update_time_info into update_ts_time_stats() sched: Update the idle statistics in get_cpu_idle_time_us() sched: Introduce a function to update the idle statistics sched: Add a comment to get_cpu_idle_time_us() cpu_stop: add dummy implementation for UP sched: Remove rq argument to the tracepoints rcu: need barrier() in UP synchronize_sched_expedited() sched: correctly place paranioa memory barriers in synchronize_sched_expedited() sched: kill paranoia check in synchronize_sched_expedited() sched: replace migration_thread with cpu_stop stop_machine: reimplement using cpu_stop cpu_stop: implement stop_cpu[s]() sched: Fix select_idle_sibling() logic in select_task_rq_fair() ...	2010-05-18 08:27:54 -07:00
Linus Torvalds	4d7b4ac22f	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits) perf tools: Add mode to build without newt support perf symbols: symbol inconsistency message should be done only at verbose=1 perf tui: Add explicit -lslang option perf options: Type check all the remaining OPT_ variants perf options: Type check OPT_BOOLEAN and fix the offenders perf options: Check v type in OPT_U?INTEGER perf options: Introduce OPT_UINTEGER perf tui: Add workaround for slang < 2.1.4 perf record: Fix bug mismatch with -c option definition perf options: Introduce OPT_U64 perf tui: Add help window to show key associations perf tui: Make <- exit menus too perf newt: Add single key shortcuts for zoom into DSO and threads perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed perf newt: Fix the 'A'/'a' shortcut for annotate perf newt: Make <- exit the ui_browser x86, perf: P4 PMU - fix counters management logic perf newt: Make <- zoom out filters perf report: Report number of events, not samples perf hist: Clarify events_stats fields usage ... Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c	2010-05-18 08:19:03 -07:00
Linus Torvalds	3aaf51ace5	Merge branch 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits) oprofile/x86: make AMD IBS hotplug capable oprofile/x86: notify cpus only when daemon is running oprofile/x86: reordering some functions oprofile/x86: stop disabled counters in nmi handler oprofile/x86: protect cpu hotplug sections oprofile/x86: remove CONFIG_SMP macros oprofile/x86: fix uninitialized counter usage during cpu hotplug oprofile/x86: remove duplicate IBS capability check oprofile/x86: move IBS code oprofile/x86: return -EBUSY if counters are already reserved oprofile/x86: moving shutdown functions oprofile/x86: reserve counter msrs pairwise oprofile/x86: rework error handler in nmi_setup() oprofile: update file list in MAINTAINERS file oprofile: protect from not being in an IRQ context oprofile: remove double ring buffering ring-buffer: Add lost event count to end of sub buffer tracing: Show the lost events in the trace_pipe output ring-buffer: Add place holder recording of dropped events tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set ...	2010-05-18 08:18:07 -07:00
Linus Torvalds	f262af3d08	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits) rcu: remove all rcu head initializations, except on_stack initializations rcu head introduce rcu head init on stack Debugobjects transition check rcu: fix build bug in RCU_FAST_NO_HZ builds rcu: RCU_FAST_NO_HZ must check RCU dyntick state rcu: make SRCU usable in modules rcu: improve the RCU CPU-stall warning documentation rcu: reduce the number of spurious RCU_SOFTIRQ invocations rcu: permit discontiguous cpu_possible_mask CPU numbering rcu: improve RCU CPU stall-warning messages rcu: print boot-time console messages if RCU configs out of ordinary rcu: disable CPU stall warnings upon panic rcu: enable CPU_STALL_VERBOSE by default rcu: slim down rcutiny by removing rcu_scheduler_active and friends rcu: refactor RCU's context-switch handling rcu: rename rcutiny rcu_ctrlblk to rcu_sched_ctrlblk rcu: shrink rcutiny by making synchronize_rcu_bh() be inline rcu: fix now-bogus rcu_scheduler_active comments. rcu: Fix bogus CONFIG_PROVE_LOCKING in comments to reflect reality. rcu: ignore offline CPUs in last non-dyntick-idle CPU check ...	2010-05-18 08:17:58 -07:00
Linus Torvalds	1014cfe2fb	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Reduce stack_trace usage lockdep: No need to disable preemption in debug atomic ops lockdep: Actually _dec_ in debug_atomic_dec lockdep: Provide off case for redundant_hardirqs_on increment lockdep: Simplify debug atomic ops lockdep: Fix redundant_hardirqs_on incremented with irqs enabled lockstat: Make lockstat counting per cpu i8253: Convert i8253_lock to raw_spinlock	2010-05-18 08:17:35 -07:00
James Bottomley	95bb335c0e	[SCSI] Merge scsi-misc-2.6 into scsi-rc-fixes-2.6 Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-05-18 10:37:41 -04:00
Steven Rostedt	f0218b3e99	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-6 Conflicts: include/trace/ftrace.h kernel/trace/trace_kprobe.c Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-18 00:35:23 -04:00
James Morris	539c99fd7f	Merge branch 'next' into for-linus	2010-05-18 08:57:00 +10:00
Ingo Molnar	9c6f7e43b4	stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback() This addresses the following compiler warning: kernel/stop_machine.c: In function 'cpu_stop_cpu_callback': kernel/stop_machine.c:297: warning: unused variable 'work' Cc: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <tip-3fc1f1e27a5b807791d72e5d992aa33b668a6626@git.kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 00:17:44 +02:00
Linus Torvalds	3d2c978e0c	Merge branch 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: ptrace: Cleanup useless header ptrace: kill BKL in ptrace syscall	2010-05-17 13:48:10 -07:00
Heiko Carstens	ab3c68ee5f	[S390] debug: enable exception-trace debug facility The exception-trace facility on x86 and other architectures prints traces to dmesg whenever a user space application crashes. s390 has such a feature since ages however it is called userprocess_debug and is enabled differently. This patch makes sure that whenever one of the two procfs files /proc/sys/kernel/userprocess_debug /proc/sys/debug/exception-trace is modified the contents of the second one changes as well. That way we keep backwards compatibilty but also support the same interface like other architectures do. Besides that the output of the traces is improved since it will now also contain the corresponding filename of the vma (when available) where the process caused a fault or trap. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2010-05-17 10:00:17 +02:00
Mark Gross	25f3a5a285	PM: PM QOS update fix This update handles a use case where pm_qos update requests need to silently fail if the update is being sent to a handle that is NULL. The problem was that the original pm_qos silently fails when a request update is passed to a parameter that has not been added to the list yet. This update restores that behavior. Signed-off-by: markgross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-17 00:21:03 +02:00
Octavian Purdila	9f977fb7ae	sysctl: add proc_do_large_bitmap The new function can be used to read/write large bitmaps via /proc. A comma separated range format is used for compact output and input (e.g. 1,3-4,10-10). Writing into the file will first reset the bitmap then update it based on the given input. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-15 23:28:39 -07:00
Amerigo Wang	00b7c3395a	sysctl: refactor integer handling proc code (Based on Octavian's work, and I modified a lot.) As we are about to add another integer handling proc function a little bit of cleanup is in order: add a few helper functions to improve code readability and decrease code duplication. In the process a bug is also fixed: if the user specifies a number with more then 20 digits it will be interpreted as two integers (e.g. 10000...13 will be interpreted as 100.... and 13). Behavior for EFAULT handling was changed as well. Previous to this patch, when an EFAULT error occurred in the middle of a write operation, although some of the elements were set, that was not acknowledged to the user (by shorting the write and returning the number of bytes accepted). EFAULT is now treated just like any other errors by acknowledging the amount of bytes accepted. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-15 23:28:38 -07:00
Don Zickus	cafcd80d21	lockup_detector: Cross arch compile fixes Combining the softlockup and hardlockup code causes watchdog.c to build even without the hardlockup detection support. So if an arch, that has the previous and the new nmi watchdog implementations cohabiting, wants to know if the generic one is in use, CONFIG_LOCKUP_DETECTOR is not a reliable check. We need to use CONFIG_HARDLOCKUP_DETECTOR instead. Fixes: kernel/built-in.o: In function `touch_nmi_watchdog': (.text+0x449bc): multiple definition of `touch_nmi_watchdog' arch/sparc/kernel/built-in.o:(.text+0x11b28): first defined here Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <20100514151121.GR15159@redhat.com> [ use CONFIG_HARDLOCKUP_DETECTOR instead of CONFIG_PERF_EVENTS_NMI] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-16 04:25:14 +02:00
Frederic Weisbecker	23637d477c	lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR This new config is deemed to simplify even more the lockup detector dependencies and can make it easier to bring a smooth sorting between archs that support the new generic lockup detector and those that still have their own, especially for those that are in the middle of this migration. Instead of checking whether we have CONFIG_LOCKUP_DETECTOR + CONFIG_PERF_EVENTS_NMI each time an arch wants to know if it needs to build its own lockup detector, take a shortcut with this new config. It is enabled only if the hardlockup detection part of the whole lockup detector is on. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com>	2010-05-16 01:57:42 +02:00
Hugh Dickins	16a2164bb0	profile: fix stats and data leakage If the kernel is large or the profiling step small, /proc/profile leaks data and readprofile shows silly stats, until readprofile -r has reset the buffer: clear the prof_buffer when it is vmalloc()ed. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-14 19:45:06 -07:00
Li Zefan	e1f7992e01	tracing: Fix function declarations if !CONFIG_STACKTRACE ftrace_trace_stack() and frace_trace_userstacke() take a struct ring_buffer argument, not struct trace_array. Commit e77405ad("tracing: pass around ring buffer instead of tracer") made this change. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BE77C14.5010806@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:33:24 -04:00
Steven Rostedt	553552ce17	tracing: Combine event filter_active and enable into single flags field The filter_active and enable both use an int (4 bytes each) to set a single flag. We can save 4 bytes per event by combining the two into a single integer. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4894944 1018052 861512 6774508 675eec vmlinux.id 4894871 1012292 861512 6768675 674823 vmlinux.flags This gives us another 5K in savings. The modification of both the enable and filter fields are done under the event_mutex, so it is still safe to combine the two. Note: Although Mathieu gave his Acked-by, he would like it documented that the reads of flags are not protected by the mutex. The way the code works, these reads will not break anything, but will have a residual effect. Since this behavior is the same even before this patch, describing this situation is left to another patch, as this patch does not change the behavior, but just brought it to Mathieu's attention. v2: Updated the event trace self test to for this change. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:33:22 -04:00
Steven Rostedt	32c0edaeaa	tracing: Remove duplicate id information in event structure Now that the trace_event structure is embedded in the ftrace_event_call structure, there is no need for the ftrace_event_call id field. The id field is the same as the trace_event type field. Removing the id and re-arranging the structure brings down the tracepoint footprint by another 5K. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig `4895024` `1023812` 861512 6780348 6775bc vmlinux.print 4894944 1018052 861512 6774508 675eec vmlinux.id Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:33:15 -04:00
Steven Rostedt	80decc70af	tracing: Move print functions into event class Currently, every event has its own trace_event structure. This is fine since the structure is needed anyway. But the print function structure (trace_event_functions) is now separate. Since the output of the trace event is done by the class (with the exception of events defined by DEFINE_EVENT_PRINT), it makes sense to have the class define the print functions that all events in the class can use. This makes a bigger deal with the syscall events since all syscall events use the same class. The savings here is another 30K. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900382 1048964 861512 6810858 67ecea vmlinux.init 4900446 1049028 861512 6810986 67ed6a vmlinux.preprint `4895024` `1023812` 861512 6780348 6775bc vmlinux.print To accomplish this, and to let the class know what event is being printed, the event structure is embedded in the ftrace_event_call structure. This should not be an issues since the event structure was created for each event anyway. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:34 -04:00
Steven Rostedt	a9a5776380	tracing: Allow events to share their print functions Multiple events may use the same method to print their data. Instead of having all events have a pointer to their print funtions, the trace_event structure now points to a trace_event_functions structure that will hold the way to print ouf the event. The event itself is now passed to the print function to let the print function know what kind of event it should print. This opens the door to consolidating the way several events print their output. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900382 1048964 861512 6810858 67ecea vmlinux.init 4900446 1049028 861512 6810986 67ed6a vmlinux.preprint This change slightly increases the size but is needed for the next change. v3: Fix the branch tracer events to handle this change. v2: Fix the new function graph tracer event calls to handle this change. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:32 -04:00
Steven Rostedt	0405ab80aa	tracing: Move raw_init from events to class The raw_init function pointer in the event is used to initialize various kinds of events. The type of initialization needed is usually classed to the kind of event it is. Two events with the same class will always have the same initialization function, so it makes sense to move this to the class structure. Perhaps even making a special system structure would work since the initialization is the same for all events within a system. But since there's no system structure (yet), this will just move it to the class. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900375 1053380 861512 6815267 67fe23 vmlinux.fields 4900382 1048964 861512 6810858 67ecea vmlinux.init The text grew very slightly, but this is a constant growth that happened with the changing of the C files that call the init code. The bigger savings is the data which will be saved the more events share a class. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:30 -04:00
Steven Rostedt	2e33af0295	tracing: Move fields from event to class structure Move the defined fields from the event to the class structure. Since the fields of the event are defined by the class they belong to, it makes sense to have the class hold the information instead of the individual events. The events of the same class would just hold duplicate information. After this change the size of the kernel dropped another 3K: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900252 1057412 861512 6819176 680d68 vmlinux.regs 4900375 1053380 861512 6815267 67fe23 vmlinux.fields Although the text increased, this was mainly due to the C files having to adapt to the change. This is a constant increase, where new tracepoints will not increase the Text. But the big drop is in the data size (as well as needed allocations to hold the fields). This will give even more savings as more tracepoints are created. Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS() with several DEFINE_EVENT()s, then the savings will be lost. But we are pushing developers to consolidate events with DEFINE_EVENT() so this should not be an issue. The kprobes define a unique class to every new event, but are dynamic so it should not be a issue. The syscalls however have a single class but the fields for the individual events are different. The syscalls use a metadata to define the fields. I moved the fields list from the event to the metadata and added a "get_fields()" function to the class. This function is used to find the fields. For normal events and kprobes, get_fields() just returns a pointer to the fields list_head in the class. For syscall events, it returns the fields list_head in the metadata for the event. v2: Fixed the syscall fields. The syscall metadata needs a list of fields for both enter and exit. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:23 -04:00
Steven Rostedt	2239291aeb	tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:19:14 -04:00
Steven Rostedt	38516ab59f	tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void()(void data, int value))probe, void data); As the first argument, all callbacks (probes) must take a (void data) parameter. So a callback for the above tracepoint will look like: void myprobe(void data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file / DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); / In a file registering this tracepoint / int my_callback(void data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_ functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 09:50:34 -04:00
Steven Rostedt	8f08201830	tracing: Create class struct for events This patch creates a ftrace_event_class struct that event structs point to. This class struct will be made to hold information to modify the events. Currently the class struct only holds the events system name. This patch slightly increases the size, but this change lays the ground work of other changes to make the footprint of tracepoints smaller. With 82 standard tracepoints, and 618 system call tracepoints (two tracepoints per syscall: enter and exit): text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class This patch also cleans up some stale comments in ftrace.h. v2: Fixed missing semi-colon in macro. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 09:33:49 -04:00
Steven Rostedt	23e117fa44	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-4	2010-05-14 09:29:52 -04:00
Ingo Molnar	0167c78190	watchdog: Export touch_softlockup_watchdog There are modules that rely on it: ERROR: "touch_softlockup_watchdog" [drivers/video/nvidia/nvidiafb.ko] undefined! Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <1273713674-8434-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-13 08:53:33 +02:00
Ingo Molnar	ad56b0797e	Merge commit 'v2.6.34-rc7' into tracing/core Merge reason: Update from -rc5 to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-13 08:11:29 +02:00
Don Zickus	d7c547335f	lockup_detector: Separate touch_nmi_watchdog code path from touch_watchdog When I combined the nmi_watchdog (hardlockup) and softlockup code, I also combined the paths the touch_watchdog and touch_nmi_watchdog took. This may not be the best idea as pointed out by Frederic W., that the touch_watchdog case probably should not reset the hardlockup count. Therefore the patch below falls back to the previous idea of keeping the touch_nmi_watchdog a superset of the touch_watchdog case. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-9-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:55 +02:00
Don Zickus	f69bcf60c3	lockup_detector: Remove nmi_watchdog.c file This file migrated to kernel/watchdog.c and then combined with kernel/softlockup.c. As a result kernel/nmi_watchdog.c is no longer needed. Just remove it. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-5-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:46 +02:00
Don Zickus	2508ce1845	lockup_detector: Remove old softlockup code Now that is no longer compiled or used, just remove it. Also move some of the code wrapped with DETECT_SOFTLOCKUP to the LOCKUP_DETECTOR wrappers because that is the code that uses it now. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-4-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:45 +02:00
Don Zickus	332fbdbca3	lockup_detector: Touch_softlockup cleanups and softlockup_tick removal Just some code cleanup to make touch_softlockup clearer and remove the softlockup_tick function as it is no longer needed. Also remove the /proc softlockup_thres call as it has been changed to watchdog_thres. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-3-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:43 +02:00
Don Zickus	58687acba5	lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:33 +02:00
Frederic Weisbecker	a9aa1d02de	Merge commit 'v2.6.34-rc7' into perf/nmi Merge reason: catch up with latest softlockup detector changes.	2010-05-12 23:20:33 +02:00
Peter P Waskiewicz Jr	4308ad8011	genirq: Clear CPU mask in affinity_hint when none is provided When an interrupt is disabled and torn down, the CPU mask returned through affinity_hint right now is all CPUs. Also, for drivers that don't provide an affinity_hint mask, this can be misleading. There should be no hint at all, meaning an empty CPU mask. [ tglx: use zalloc_cpumask_var instead of clearing it under the lock ] Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Cc: davem@davemloft.net Cc: arjan@linux.jf.intel.com Cc: bhutchings@solarflare.com LKML-Reference: <20100505205638.5426.87189.stgit@ppwaskie-hc2.jf.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-12 11:23:34 +02:00
David S. Miller	278554bd65	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/wireless/ath/ar9170/usb.c drivers/scsi/iscsi_tcp.c net/ipv4/ipmr.c	2010-05-12 00:05:35 -07:00
KAMEZAWA Hiroyuki	747388d78a	memcg: fix css_is_ancestor() RCU locking Some callers (in memcontrol.c) calls css_is_ancestor() without rcu_read_lock. Because css_is_ancestor() has to access RCU protected data, it should be under rcu_read_lock(). This makes css_is_ancestor() itself does safe access to RCU protected area. (At least, "root" can have refcnt==0 if it's not an ancestor of "child". So, we need rcu_read_lock().) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
KAMEZAWA Hiroyuki	7f0f154641	memcg: fix css_id() RCU locking for real Commit `ad4ba37537` ("memcg: css_id() must be called under rcu_read_lock()") modifies memcontol.c for fixing RCU check message. But Andrew Morton pointed out that the fix doesn't seems sane and it was just for hidining lockdep messages. This is a patch for do proper things. Checking again, all places, accessing without rcu_read_lock, that commit fixies was intentional.... all callers of css_id() has reference count on it. So, it's not necessary to be under rcu_read_lock(). Considering again, we can use rcu_dereference_check for css_id(). We know css->id is valid if css->refcnt > 0. (css->id never changes and freed after css->refcnt going to be 0.) This patch makes use of rcu_dereference_check() in css_id/depth and remove unnecessary rcu-read-lock added by the commit. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
Vitaliy Gusev	11cad320a4	bsdacct: use del_timer_sync() in acct_exit_ns() acct_exit_ns --> acct_file_reopen deletes timer without check timer execution on other CPUs. So acct_timeout() can change an unmapped memory. Signed-off-by: Vitaliy Gusev <vgusev@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
Vitaly Mayatskikh	475f9aa6aa	kexec: fix OOPS in crash_kernel_shrink Two "echo 0 > /sys/kernel/kexec_crash_size" OOPSes kernel. Also content of this file is invalid after first shrink to zero: it shows 1 instead of 0. This scenario is unlikely to happen often (root privs, valid crashkernel= in cmdline, dump-capture kernel not loaded), I hit it only by chance. This patch fixes it. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Cc: Cong Wang <amwang@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
Robin Holt	34441427aa	revert "procfs: provide stack information for threads" and its fixup commits Originally, commit `d899bf7b` ("procfs: provide stack information for threads") attempted to introduce a new feature for showing where the threadstack was located and how many pages are being utilized by the stack. Commit `c44972f1` ("procfs: disable per-task stack usage on NOMMU") was applied to fix the NO_MMU case. Commit `89240ba0` ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") was applied to fix a bug in ia32 executables being loaded. Commit `9ebd4eba7` ("procfs: fix /proc/<pid>/stat stack pointer for kernel threads") was applied to fix a bug which had kernel threads printing a userland stack address. Commit `1306d603f` ('proc: partially revert "procfs: provide stack information for threads"') was then applied to revert the stack pages being used to solve a significant performance regression. This patch nearly undoes the effect of all these patches. The reason for reverting these is it provides an unusable value in field 28. For x86_64, a fork will result in the task->stack_start value being updated to the current user top of stack and not the stack start address. This unpredictability of the stack_start value makes it worthless. That includes the intended use of showing how much stack space a thread has. Other architectures will get different values. As an example, ia64 gets 0. The do_fork() and copy_process() functions appear to treat the stack_start and stack_size parameters as architecture specific. I only partially reverted `c44972f1` ("procfs: disable per-task stack usage on NOMMU") . If I had completely reverted it, I would have had to change mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is configured. Since I could not test the builds without significant effort, I decided to not change mm/Makefile. I only partially reverted `89240ba0` ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") . I left the KSTK_ESP() change in place as that seemed worthwhile. Signed-off-by: Robin Holt <holt@sgi.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:41 -07:00
Paul E. McKenney	72d5a9f7a9	rcu: remove all rcu head initializations, except on_stack initializations Remove all rcu head inits. We don't care about the RCU head state before passing it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can keep track of objects on stack. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-11 16:10:47 -07:00
Alan Cox	8b6d043b7e	resource: shared I/O region support SuperIO devices share regions and use lock/unlock operations to chip select. We therefore need to be able to request a resource and wait for it to be freed by whichever other SuperIO device currently hogs it. Right now you have to poll which is horrible. Add a MUXED field to IO port resources. If the MUXED field is set on the resource and on the request (via request_muxed_region) then we block until the previous owner of the muxed resource releases their region. This allows us to implement proper resource sharing and locking for superio chips using code of the form enable_my_superio_dev() { request_muxed_region(0x44, 0x02, "superio:watchdog"); outb() ..sequence to enable chip } disable_my_superio_dev() { outb() .. sequence of disable chip release_region(0x44, 0x02); } Signed-off-by: Giel van Schijndel <me@mortis.eu> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-05-11 12:01:10 -07:00
Changli Gao	a93d2f1744	sched, wait: Use wrapper functions epoll should not touch flags in wait_queue_t. This patch introduces a new function __add_wait_queue_exclusive(), for the users, who use wait queue as a LIFO queue. __add_wait_queue_tail_exclusive() is introduced too instead of add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as it is a duplicate of __remove_wait_queue(), disliked by users, and with less users. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <containers@lists.linux-foundation.org> LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 17:43:58 +02:00
Peter Zijlstra	96c21a460a	perf: Fix exit() vs event-groups Corey reported that the value scale times of group siblings are not updated when the monitored task dies. The problem appears to be that we only update the group leader's time values, fix it by updating the whole group. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # .34.x LKML-Reference: <1273588935.1810.6.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 17:08:24 +02:00
Peter Zijlstra	050735b08c	perf: Fix exit() vs PERF_FORMAT_GROUP Both Stephane and Corey reported that PERF_FORMAT_GROUP didn't work as expected if the task the counters were attached to quit before the read() call. The cause is that we unconditionally destroy the grouping when we remove counters from their context. Fix this by splitting off the group destroy from the list removal such that perf_event_remove_from_context() does not do this and change perf_event_release() to do so. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: <stable@kernel.org> # .34.x LKML-Reference: <1273571513.5605.3527.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 15:46:43 +02:00
Ingo Molnar	e3174cfd2a	Revert "perf: Fix exit() vs PERF_FORMAT_GROUP" This reverts commit `4fd38e4595`. It causes various crashes and hangs when events are activated. The cause is not fully understood yet but we need to revert it because the effects are severe. Reported-by: Stephane Eranian <eranian@google.com> Reported-by: Lin Ming <ming.m.lin@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 08:31:49 +02:00
Matt Helsley	8f77578cc2	Freezer / cgroup freezer: Update stale locking comments Update stale comments regarding locking order and add a little more detail so it's easier to follow the locking between the cgroup freezer and the power management freezer code. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:18:47 +02:00
Mark Gross	ed77134bfc	PM QOS update This patch changes the string based list management to a handle base implementation to help with the hot path use of pm-qos, it also renames much of the API to use "request" as opposed to "requirement" that was used in the initial implementation. I did this because request more accurately represents what it actually does. Also, I added a string based ABI for users wanting to use a string interface. So if the user writes 0xDDDDDDDD formatted hex it will be accepted by the interface. (someone asked me for it and I don't think it hurts anything.) This patch updates some documentation input I got from Randy. Signed-off-by: markgross <mgross@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:19 +02:00
Randy Dunlap	0fef8b1e83	PM / Hibernate: Fix block_io.c printk warning Fix printk format warning in block_io.c: kernel/power/block_io.c:41: warning: format '%ld' expects type 'long int', but argument 2 has type 'sector_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	6f612af578	PM / Hibernate: Group swap ops Move all the swap processing into one function. It will make swap calls from a non-swap code easier. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	51fb352b2c	PM / Hibernate: Move the first_sector out of swsusp_write The first sector knowledge is swap-only specific. Move it into the swap handle. This will be needed for later non-swap specific code moving into snapshot.c. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	8a0d613fa1	PM / Hibernate: Separate block_io Move block I/O operations to a separate file. It is because it will be used later not only by the swap writer. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	d3c1b24c50	PM / Hibernate: Snapshot cleanup Remove support of reads with offset. This means snapshot_read/write_next now does not accept count parameter. It allows to clean up the functions and snapshot handle which no longer needs to care about offsets. /dev/snapshot handler is converted to simple_{read_from,write_to}_buffer which take care of offsets. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:17 +02:00
Paul E. McKenney	d822ed1094	rcu: fix build bug in RCU_FAST_NO_HZ builds Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	77e38ed347	rcu: RCU_FAST_NO_HZ must check RCU dyntick state The current version of RCU_FAST_NO_HZ reproduces the old CLASSIC_RCU dyntick-idle bug, as it fails to detect CPUs that have interrupted or NMIed out of dyntick-idle mode. Fix this by making rcu_needs_cpu() check the state in the per-CPU rcu_dynticks variables, thus correctly detecting the dyntick-idle state from an RCU perspective. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	d21670acab	rcu: reduce the number of spurious RCU_SOFTIRQ invocations Lai Jiangshan noted that up to 10% of the RCU_SOFTIRQ are spurious, and traced this down to the fact that the current grace-period machinery will uselessly raise RCU_SOFTIRQ when a given CPU needs to go through a quiescent state, but has not yet done so. In this situation, there might well be nothing that RCU_SOFTIRQ can do, and the overhead can be worth worrying about in the ksoftirqd case. This patch therefore avoids raising RCU_SOFTIRQ in this situation. Changes since v1 (http://lkml.org/lkml/2010/3/30/122 from Lai Jiangshan): o Omit the rcu_qs_pending() prechecks, as they aren't that much less expensive than the quiescent-state checks. o Merge with the set_need_resched() patch that reduces IPIs. o Add the new n_rp_report_qs field to the rcu_pending tracing output. o Update the tracing documentation accordingly. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	4a90a0681c	rcu: permit discontiguous cpu_possible_mask CPU numbering TREE_RCU assumes that CPU numbering is contiguous, but some users need large holes in the numbering to better map to hardware layout. This patch makes TREE_RCU (and TREE_PREEMPT_RCU) tolerate large holes in the CPU numbering. However, NR_CPUS must still be greater than the largest CPU number. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	4300aa642c	rcu: improve RCU CPU stall-warning messages The existing RCU CPU stall-warning messages can be confusing, especially in the case where one CPU detects a single other stalled CPU. In addition, the console messages did not say which flavor of RCU detected the stall, which can make it difficult to work out exactly what is causing the stall. This commit improves these messages. Requested-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	26845c2860	rcu: print boot-time console messages if RCU configs out of ordinary Print boot-time messages if tracing is enabled, if fanout is set to non-default values, if exact fanout is specified, if accelerated dyntick-idle grace periods have been enabled, if RCU-lockdep is enabled, if rcutorture has been boot-time enabled, if the CPU stall detector has been disabled, or if four-level hierarchy has been enabled. This is all for TREE_RCU and TREE_PREEMPT_RCU. TINY_RCU will be handled separately, if at all. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	c68de2097a	rcu: disable CPU stall warnings upon panic The current RCU CPU stall warnings remain enabled even after a panic occurs, which some people have found to be a bit counterproductive. This patch therefore uses a notifier to disable stall warnings once a panic occurs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	bbad937983	rcu: slim down rcutiny by removing rcu_scheduler_active and friends TINY_RCU does not need rcu_scheduler_active unless CONFIG_DEBUG_LOCK_ALLOC. So conditionally compile rcu_scheduler_active in order to slim down rcutiny a bit more. Also gets rid of an EXPORT_SYMBOL_GPL, which is responsible for most of the slimming. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	25502a6c13	rcu: refactor RCU's context-switch handling The addition of preemptible RCU to treercu resulted in a bit of confusion and inefficiency surrounding the handling of context switches for RCU-sched and for RCU-preempt. For RCU-sched, a context switch is a quiescent state, pure and simple, just like it always has been. For RCU-preempt, a context switch is in no way a quiescent state, but special handling is required when a task blocks in an RCU read-side critical section. However, the callout from the scheduler and the outer loop in ksoftirqd still calls something named rcu_sched_qs(), whose name is no longer accurate. Furthermore, when rcu_check_callbacks() notes an RCU-sched quiescent state, it ends up unnecessarily (though harmlessly, aside from the performance hit) enqueuing the current task if it happens to be running in an RCU-preempt read-side critical section. This not only increases the maximum latency of scheduler_tick(), it also needlessly increases the overhead of the next outermost rcu_read_unlock() invocation. This patch addresses this situation by separating the notion of RCU's context-switch handling from that of RCU-sched's quiescent states. The context-switch handling is covered by rcu_note_context_switch() in general and by rcu_preempt_note_context_switch() for preemptible RCU. This permits rcu_sched_qs() to handle quiescent states and only quiescent states. It also reduces the maximum latency of scheduler_tick(), though probably by much less than a microsecond. Finally, it means that tasks within preemptible-RCU read-side critical sections avoid incurring the overhead of queuing unless there really is a context switch. Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-10 11:08:33 -07:00
Paul E. McKenney	99652b54de	rcu: rename rcutiny rcu_ctrlblk to rcu_sched_ctrlblk Make naming line up in preparation for CONFIG_TINY_PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:32 -07:00
Paul E. McKenney	da848c47bc	rcu: shrink rcutiny by making synchronize_rcu_bh() be inline Because synchronize_rcu_bh() is identical to synchronize_sched(), make the former a static inline invoking the latter, saving the overhead of an EXPORT_SYMBOL_GPL() and the duplicate code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:32 -07:00
Lai Jiangshan	5db356736a	rcu: ignore offline CPUs in last non-dyntick-idle CPU check Offline CPUs are not in nohz_cpu_mask, but can be ignored when checking for the last non-dyntick-idle CPU. This patch therefore only checks online CPUs for not being dyntick idle, allowing fast entry into full-system dyntick-idle state even when there are some offline CPUs. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Lai Jiangshan	0c34029abd	rcu: move some code from macro to function Shrink the RCU_INIT_FLAVOR() macro by moving all but the initialization of the ->rda[] array to rcu_init_one(). The call to rcu_init_one() can then be moved to the end of the RCU_INIT_FLAVOR() macro, which is required because rcu_boot_init_percpu_data(), which is now called from rcu_init_one(), depends on the initialization of the ->rda[] array. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Lai Jiangshan	f261414f0d	rcu: make dead code really dead cleanup: make dead code really dead Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Paul E. McKenney	d25eb9442b	rcu: substitute set_need_resched for sending resched IPIs This patch adds a check to __rcu_pending() that does a local set_need_resched() if the current CPU is holding up the current grace period and if force_quiescent_state() will be called soon. The goal is to reduce the probability that force_quiescent_state() will need to do smp_send_reschedule(), which sends an IPI and is therefore more expensive on most architectures. Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Lai Jiangshan	2b3fc35f69	rcu: optionally leave lockdep enabled after RCU lockdep splat There is no need to disable lockdep after an RCU lockdep splat, so remove the debug_lockdeps_off() from lockdep_rcu_dereference(). To avoid repeated lockdep splats, use a static variable in the inlined rcu_dereference_check() and rcu_dereference_protected() macros so that a given instance splats only once, but so that multiple instances can be detected per boot. This is controlled by a new config variable CONFIG_PROVE_RCU_REPEATEDLY, which is disabled by default. This provides the normal lockdep behavior by default, but permits people who want to find multiple RCU-lockdep splats per boot to easily do so. Requested-by: Eric Paris <eparis@redhat.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Tested-by: Eric Paris <eparis@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
John Stultz	d7e81c269d	clocksource: Add clocksource_register_hz/khz interface How to pick good mult/shift pairs has always been difficult to describe to folks writing clocksource drivers, since it requires careful tradeoffs in adjustment accuracy vs overflow limits. Now, with the clocks_calc_mult_shift function, its much easier. However, not many clocksources have converted to using that function, and there is still the issue of the max interval length assumption being made by each clocksource driver independently. So this patch simplifies the registration process by having clocksources be registered with a hz/khz value and the registration function taking care of setting mult/shift. This should take most of the confusion out of writing a clocksource driver. Additionally it also keeps the shift size tradeoff (more accuracy vs longer possible nohz times) centralized so the timekeeping core can keep track of the assumptions being made. [ tglx: Coding style and comments fixed ] Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1273280858-30143-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-10 14:24:26 +02:00
Stanislaw Gruszka	29f87b793d	posix-cpu-timers: Optimize run_posix_cpu_timers() We can optimize and simplify things taking into account signal->cputimer is always running when we have configured any process wide cpu timer. In check_process_timers(), we don't have to check if new updated value of signal->cputime_expires is smaller, since we maintain new first expiration time ({prof,virt,sched}_expires) in code flow and all other writes to expiration cache are protected by sighand->siglock . Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-10 14:24:26 +02:00
Thomas Gleixner	dbb6be6d5e	Merge branch 'linus' into timers/core Reason: Further posix_cpu_timer patches depend on mainline changes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-10 14:20:42 +02:00
Ingo Molnar	7c224a03a7	Merge commit 'v2.6.34-rc7' into oprofile Merge reason: Update to Linus's latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-10 13:12:29 +02:00
Li Zefan	af507ae8a0	sched: Remove a stale comment This comment should have been removed together with uids_mutex when removing user sched. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dhaval Giani <dhaval.giani@gmail.com> LKML-Reference: <4BE77C6B.5010402@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-10 08:48:39 +02:00
Arjan van de Ven	0224cf4c5e	sched: Intoduce get_cpu_iowait_time_us() For the ondemand cpufreq governor, it is desired that the iowait time is microaccounted in a similar way as idle time is. This patch introduces the infrastructure to account and expose this information via the get_cpu_iowait_time_us() function. [akpm@linux-foundation.org: fix CONFIG_NO_HZ=n build] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082523.284feab6@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:27 +02:00
Arjan van de Ven	e0e37c200f	sched: Eliminate the ts->idle_lastupdate field Now that the only user of ts->idle_lastupdate is update_ts_time_stats(), the entire field can be eliminated. In update_ts_time_stats(), idle_lastupdate is first set to "now", and a few lines later, the only user is an if() statement that assigns a variable either to "now" or to ts->idle_lastupdate, which has the value of "now" at that point. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082439.2fab0b4f@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:26 +02:00
Arjan van de Ven	8d63bf949e	sched: Fold updating of the last_update_time_info into update_ts_time_stats() This patch folds the updating of the last_update_time into the update_ts_time_stats() function, and updates the callers. This allows for further cleanups that are done in the next patch. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082403.60072967@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:26 +02:00
Arjan van de Ven	8c7b09f43f	sched: Update the idle statistics in get_cpu_idle_time_us() Right now, get_cpu_idle_time_us() only reports the idle statistics upto the point the CPU entered last idle; not what is valid right now. This patch adds an update of the idle statistics to get_cpu_idle_time_us(), so that calling this function always returns statistics that are accurate at the point of the call. This includes resetting the start of the idle time for accounting purposes to avoid double accounting. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082323.2d2f1945@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:26 +02:00
Arjan van de Ven	595aac488b	sched: Introduce a function to update the idle statistics Currently, two places update the idle statistics (and more to come later in this series). This patch creates a helper function for updating these statistics. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082245.163e67ed@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:25 +02:00
Arjan van de Ven	b1f724c305	sched: Add a comment to get_cpu_idle_time_us() The exported function get_cpu_idle_time_us() has no comment describing it; add a kerneldoc comment Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082208.7cb721f0@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:25 +02:00
Frederic Weisbecker	9313543945	tracing: Drop the nested field from lock_release event Drop the nested field as we don't use it. Every nested state can be computed from a state machine on post processing already. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-05-09 13:45:34 +02:00
Frederic Weisbecker	883a2a3189	tracing: Drop lock_acquired waittime field Drop the waittime field from the lock_acquired event, we can calculate it by substracting the lock_acquired event timestamp with the matching lock_acquire one. It is not needed and takes useless space in the traces. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-05-09 13:45:32 +02:00
Ingo Molnar	e7858f52a5	Merge branch 'cpu_stop' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into sched/core	2010-05-08 18:11:19 +02:00
Masami Hiramatsu	c0614829c1	kprobes: Move enable/disable_kprobe() out from debugfs code Move enable/disable_kprobe() API out from debugfs related code, because these interfaces are not related to debugfs interface. This fixes a compiler warning. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> LKML-Reference: <20100427223312.2322.60512.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-08 18:08:30 +02:00
Tejun Heo	bbf1bb3eee	cpu_stop: add dummy implementation for UP When !CONFIG_SMP, cpu_stop functions weren't defined at all which could lead to build failures if UP code uses cpu_stop facility. Add dummy cpu_stop implementation for UP. The waiting variants execute the work function directly with preempt disabled and stop_one_cpu_nowait() schedules a workqueue work. Makefile and ifdefs around stop_machine implementation are updated to accomodate CONFIG_SMP && !CONFIG_STOP_MACHINE case. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ingo Molnar <mingo@elte.hu>	2010-05-08 17:12:33 +02:00
Paul Mackerras	6e85158cf5	perf_event: Make software events work again Commit `6bde9b6ce0` ("perf: Add group scheduling transactional APIs") added code to allow a group to be scheduled in a single transaction. However, it introduced a bug in handling events whose pmu does not implement transactions -- at the end of scheduling in the events in the group, in the non-transactional case the code now falls through to the group_error label, and proceeds to unschedule all the events in the group and return failure. This fixes it by returning 0 (success) in the non-transactional case. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: eranian@gmail.com LKML-Reference: <20100508105800.GB10650@brick.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-08 13:16:55 +02:00
Neil Horman	3ee943728f	ipv4: remove ip_rt_secret timer (v4) A while back there was a discussion regarding the rt_secret_interval timer. Given that we've had the ability to do emergency route cache rebuilds for awhile now, based on a statistical analysis of the various hash chain lengths in the cache, the use of the flush timer is somewhat redundant. This patch removes the rt_secret_interval sysctl, allowing us to rely solely on the statistical analysis mechanism to determine the need for route cache flushes. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-08 01:57:52 -07:00
Linus Torvalds	91bc482ec5	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: create rcu_my_thread_group_empty() wrapper memcg: css_id() must be called under rcu_read_lock() cgroup: Check task_lock in task_subsys_state() sched: Fix an RCU warning in print_task() cgroup: Fix an RCU warning in alloc_css_id() cgroup: Fix an RCU warning in cgroup_path() KEYS: Fix an RCU warning in the reading of user keys KEYS: Fix an RCU warning	2010-05-07 13:58:21 -07:00
Lin Ming	6bde9b6ce0	perf: Add group scheduling transactional APIs Add group scheduling transactional APIs to struct pmu. These APIs will be implemented in arch code, based on Peter's idea as below. > the idea behind hw_perf_group_sched_in() is to not perform > schedulability tests on each event in the group, but to add the group > as a whole and then perform one test. > > Of course, when that test fails, you'll have to roll-back the whole > group again. > > So start_txn (or a better name) would simply toggle a flag in the pmu > implementation that will make pmu::enable() not perform the > schedulablilty test. > > Then commit_txn() will perform the schedulability test (so note the > method has to have a !void return value. > > This will allow us to use the regular > kernel/perf_event.c::group_sched_in() and all the rollback code. > Currently each hw_perf_group_sched_in() implementation duplicates all > the rolllback code (with various bugs). ->start_txn: Start group events scheduling transaction, set a flag to make pmu::enable() not perform the schedulability test, it will be performed at commit time. ->commit_txn: Commit group events scheduling transaction, perform the group schedulability as a whole ->cancel_txn: Stop group events scheduling transaction, clear the flag so pmu::enable() will perform the schedulability test. Reviewed-by: Stephane Eranian <eranian@google.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1272002160.5707.60.camel@minggr.sh.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:31:02 +02:00
Peter Zijlstra	a0507c84bf	perf: Annotate perf_event_read_group() vs perf_event_release_kernel() Stephane reported a lockdep warning while using PERF_FORMAT_GROUP. The issue is that perf_event_read_group() takes faults while holding the ctx->mutex, while perf_event_release_kernel() can be called from munmap(). Which makes for an AB-BA deadlock. Except we can never establish the deadlock because we'll only ever call perf_event_release_kernel() after all file descriptors are dead so there is no concurrency possible. Reported-by: Stephane Eranian <eranian@google.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:30:59 +02:00
Ingo Molnar	cce9131781	Merge branch 'perf/urgent' into perf/core Merge reason: Resolve patch dependency Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:30:30 +02:00
Peter Zijlstra	4fd38e4595	perf: Fix exit() vs PERF_FORMAT_GROUP Both Stephane and Corey reported that PERF_FORMAT_GROUP didn't work as expected if the task the counters were attached to quit before the read() call. The cause is that we unconditionally destroy the grouping when we remove counters from their context. Fix this by only doing this when we free the counter itself. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1273160566.5605.404.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:30:17 +02:00
Peter Zijlstra	27a9da6538	sched: Remove rq argument to the tracepoints struct rq isn't visible outside of sched.o so its near useless to expose the pointer, also there are no users of it, so remove it. Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1272997616.1642.207.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:28:17 +02:00
Ingo Molnar	48652ced15	Merge commit 'v2.6.34-rc6' into sched/core	2010-05-07 11:27:54 +02:00
Yong Zhang	4726f2a617	lockdep: Reduce stack_trace usage When calling check_prevs_add(), if all validations passed add_lock_to_list() will add new lock to dependency tree and alloc stack_trace for each list_entry. But at this time, we are always on the same stack, so stack_trace for each list_entry has the same value. This is redundant and eats up lots of memory which could lead to warning on low MAX_STACK_TRACE_ENTRIES. Use one copy of stack_trace instead. V2: As suggested by Peter Zijlstra, move save_trace() from check_prevs_add() to check_prev_add(). Add tracking for trylock dependence which is also redundant. Signed-off-by: Yong Zhang <yong.zhang0@windriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100504065711.GC10784@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:27:26 +02:00
Paul E. McKenney	fc390cde36	rcu: need barrier() in UP synchronize_sched_expedited() If synchronize_sched_expedited() is ever to be called from within kernel/sched.c in a !SMP PREEMPT kernel, the !SMP implementation needs a barrier(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-05-07 07:23:21 +02:00
Dan Carpenter	d9f599e1e6	perf: Fix check at end of event search The original code doesn't work because "call" is never NULL there. Signed-off-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <20100320143911.GF5331@bicker> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-06 19:49:52 -04:00
Paul E. McKenney	cc631fb732	sched: correctly place paranioa memory barriers in synchronize_sched_expedited() The memory barriers must be in the SMP case, not in the !SMP case. Also add a barrier after the atomic_inc() in order to ensure that other CPUs see post-synchronize_sched_expedited() actions as following the expedited grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-05-06 18:49:21 +02:00
Tejun Heo	94458d5ecb	sched: kill paranoia check in synchronize_sched_expedited() The paranoid check which verifies that the cpu_stop callback is actually called on all online cpus is completely superflous. It's guaranteed by cpu_stop facility and if it didn't work as advertised other things would go horribly wrong and trying to recover using synchronize_sched() wouldn't be very meaningful. Kill the paranoid check. Removal of this feature is done as a separate step so that it can serve as a bisection point if something actually goes wrong. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:21 +02:00
Tejun Heo	969c79215a	sched: replace migration_thread with cpu_stop Currently migration_thread is serving three purposes - migration pusher, context to execute active_load_balance() and forced context switcher for expedited RCU synchronize_sched. All three roles are hardcoded into migration_thread() and determining which job is scheduled is slightly messy. This patch kills migration_thread and replaces all three uses with cpu_stop. The three different roles of migration_thread() are splitted into three separate cpu_stop callbacks - migration_cpu_stop(), active_load_balance_cpu_stop() and synchronize_sched_expedited_cpu_stop() - and each use case now simply asks cpu_stop to execute the callback as necessary. synchronize_sched_expedited() was implemented with private preallocated resources and custom multi-cpu queueing and waiting logic, both of which are provided by cpu_stop. synchronize_sched_expedited_count is made atomic and all other shared resources along with the mutex are dropped. synchronize_sched_expedited() also implemented a check to detect cases where not all the callback got executed on their assigned cpus and fall back to synchronize_sched(). If called with cpu hotplug blocked, cpu_stop already guarantees that and the condition cannot happen; otherwise, stop_machine() would break. However, this patch preserves the paranoid check using a cpumask to record on which cpus the stopper ran so that it can serve as a bisection point if something actually goes wrong theree. Because the internal execution state is no longer visible, rcu_expedited_torture_stats() is removed. This patch also renames cpu_stop threads to from "stopper/%d" to "migration/%d". The names of these threads ultimately don't matter and there's no reason to make unnecessary userland visible changes. With this patch applied, stop_machine() and sched now share the same resources. stop_machine() is faster without wasting any resources and sched migration users are much cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:21 +02:00
Tejun Heo	3fc1f1e27a	stop_machine: reimplement using cpu_stop Reimplement stop_machine using cpu_stop. As cpu stoppers are guaranteed to be available for all online cpus, stop_machine_create/destroy() are no longer necessary and removed. With resource management and synchronization handled by cpu_stop, the new implementation is much simpler. Asking the cpu_stop to execute the stop_cpu() state machine on all online cpus with cpu hotplug disabled is enough. stop_machine itself doesn't need to manage any global resources anymore, so all per-instance information is rolled into struct stop_machine_data and the mutex and all static data variables are removed. The previous implementation created and destroyed RT workqueues as necessary which made stop_machine() calls highly expensive on very large machines. According to Dimitri Sivanich, preventing the dynamic creation/destruction makes booting faster more than twice on very large machines. cpu_stop resources are preallocated for all online cpus and should have the same effect. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:20 +02:00
Tejun Heo	1142d81029	cpu_stop: implement stop_cpu[s]() Implement a simplistic per-cpu maximum priority cpu monopolization mechanism. A non-sleeping callback can be scheduled to run on one or multiple cpus with maximum priority monopolozing those cpus. This is primarily to replace and unify RT workqueue usage in stop_machine and scheduler migration_thread which currently is serving multiple purposes. Four functions are provided - stop_one_cpu(), stop_one_cpu_nowait(), stop_cpus() and try_stop_cpus(). This is to allow clean sharing of resources among stop_cpu and all the migration thread users. One stopper thread per cpu is created which is currently named "stopper/CPU". This will eventually replace the migration thread and take on its name. * This facility was originally named cpuhog and lived in separate files but Peter Zijlstra nacked the name and thus got renamed to cpu_stop and moved into stop_machine.c. * Better reporting of preemption leak as per Peter's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:20 +02:00
Paul E. McKenney	ee84b8243b	rcu: create rcu_my_thread_group_empty() wrapper Some RCU-lockdep splat repairs need to know whether they are running in a single-threaded process. Unfortunately, the thread_group_empty() primitive is defined in sched.h, and can induce #include hell. This commit therefore introduces a rcu_my_thread_group_empty() wrapper that is defined in rcupdate.c, thus avoiding the need to include sched.h everywhere. Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>	2010-05-06 09:28:41 -07:00
James Morris	043b4d40f5	Merge branch 'master' into next Conflicts: security/keys/keyring.c Resolved conflict with whitespace fix in find_keyring_by_name() Signed-off-by: James Morris <jmorris@namei.org>	2010-05-06 22:21:04 +10:00
James Morris	0ffbe2699c	Merge branch 'master' into next	2010-05-06 10:56:07 +10:00
Thiago Farina	668eb65f09	tracing: Fix "integer as NULL pointer" warning. kernel/trace/trace_output.c:256:24: warning: Using plain integer as NULL pointer Signed-off-by: Thiago Farina <tfransosi@gmail.com> LKML-Reference: <1264349038-1766-3-git-send-email-tfransosi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-05 12:01:26 -04:00
Linus Torvalds	8777c793d6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: flush_delayed_work: keep the original workqueue for re-queueing	2010-05-05 07:56:36 -07:00
Linus Torvalds	f5fa05d972	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix resource leak in failure path of perf_event_open()	2010-05-04 15:16:15 -07:00
Linus Torvalds	f2809d61d6	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Fix RCU lockdep splat on freezer_fork path rcu: Fix RCU lockdep splat in set_task_cpu on fork path mutex: Don't spin when the owner CPU is offline or other weird cases	2010-05-04 15:15:43 -07:00
Li Zefan	b629317e66	sched: Fix an RCU warning in print_task() With CONFIG_PROVE_RCU=y, a warning can be triggered: $ cat /proc/sched_debug ... kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection! ... Both cgroup_path() and task_group() should be called with either rcu_read_lock or cgroup_mutex held. The rcu_dereference_check() does include cgroup_lock_is_held(), so we know that this lock is not held. Therefore, in a CONFIG_PREEMPT kernel, to say nothing of a CONFIG_PREEMPT_RT kernel, the original code could have ended up copying a string out of the freelist. This patch inserts RCU read-side primitives needed to prevent this scenario. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-04 09:25:01 -07:00
Li Zefan	fae9c79170	cgroup: Fix an RCU warning in alloc_css_id() With CONFIG_PROVE_RCU=y, a warning can be triggered: # mount -t cgroup -o memory xxx /mnt # mkdir /mnt/0 ... kernel/cgroup.c:4442 invoked rcu_dereference_check() without protection! ... This is a false-positive. It's safe to directly access parent_css->id. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-04 09:25:00 -07:00
Li Zefan	9a9686b634	cgroup: Fix an RCU warning in cgroup_path() with CONFIG_PROVE_RCU=y, a warning can be triggered: # mount -t cgroup -o debug xxx /mnt # cat /proc/$$/cgroup ... kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection! ... This is a false-positive, because cgroup_path() can be called with either rcu_read_lock() held or cgroup_mutex held. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-04 09:24:59 -07:00
Borislav Petkov	956097912c	ring-buffer: Wrap open-coded WARN_ONCE Wrap open-coded WARN_ONCE functionality into the equivalent macro. Signed-off-by: Borislav Petkov <bp@alien8.de> LKML-Reference: <20100502060354.GA5281@liondog.tnic> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-04 12:23:47 -04:00
Frederic Weisbecker	777d0411cd	hw_breakpoints: Fix percpu build failure Fix this build error: kernel/hw_breakpoint.c:58:1: error: pasting "__pcpu_scope_" and "" does not give a valid preprocessing token It happens if CONFIG_DEBUG_FORCE_WEAK_PER_CPU, because we concatenate someting with the name and we have the "" in the name. Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> LKML-Reference: <20100503133942.GA5497@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-04 08:39:36 +02:00
Frederic Weisbecker	54d47a2be5	lockdep: No need to disable preemption in debug atomic ops No need to disable preemption in the debug_atomic_* ops, as we ensure interrupts are disabled already. So let's use the __this_cpu_ops() rather than this_cpu_ops() that enclose the ops in a preempt disabled section. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-04 05:38:16 +02:00
Frederic Weisbecker	fa9a97dec6	lockdep: Actually _dec_ in debug_atomic_dec Fix a silly copy-paste mistake that was making debug_atomic_dec use this_cpu_inc instead of this_cpu_dec. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-04 05:38:12 +02:00
Frederic Weisbecker	ba697f40db	lockdep: Provide off case for redundant_hardirqs_on increment We forgot to provide a !CONFIG_DEBUG_LOCKDEP case for the redundant_hardirqs_on stat handling. Manage that in the headers with a new __debug_atomic_inc() helper. Fixes: kernel/lockdep.c:2306: error: 'lockdep_stats' undeclared (first use in this function) kernel/lockdep.c:2306: error: (Each undeclared identifier is reported only once kernel/lockdep.c:2306: error: for each function it appears in.) Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-04 05:37:28 +02:00
Peter P Waskiewicz Jr	e7a297b0d7	genirq: Add CPU mask affinity hint This patch adds a cpumask affinity hint to the irq_desc structure, along with a registration function and a read-only proc entry for each interrupt. This affinity_hint handle for each interrupt can be used by underlying drivers that need a better mechanism to control interrupt affinity. The underlying driver can register a cpumask for the interrupt, which will allow the driver to provide the CPU mask for the interrupt to anything that requests it. The intent is to extend the userspace daemon, irqbalance, to help hint to it a preferred CPU mask to balance the interrupt into. [ tglx: Fixed compile warnings, added WARN_ON, made SMP only ] Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Cc: davem@davemloft.net Cc: arjan@linux.jf.intel.com Cc: bhutchings@solarflare.com LKML-Reference: <20100430214445.3992.41647.stgit@ppwaskie-hc2.jf.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-03 11:50:57 +02:00
Steffen Klassert	6751fb3c0e	padata: Use get_online_cpus/put_online_cpus This patch puts get_online_cpus/put_online_cpus around the places we modify the padata cpumask to ensure that no cpu goes offline during this operation. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:12 +08:00
Steffen Klassert	7b389b2cc5	padata: Initialize the padata queues only for the used cpus padata_alloc_pd set up queues for all possible cpus. This patch changes this to set up the queues just for the used cpus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:11 +08:00
Steffen Klassert	7d0d2d385c	padata: Remove superfluous might_sleep might_sleep() was placed before mutex_lock() in some places. We remove them because mutex_lock() does might_sleep() too. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:11 +08:00
Steffen Klassert	e2cb2f1c2c	padata: cpu hotplug code should depend on CONFIG_HOTPLUG_CPU This patch makes the padata cpu hotplug code dependend on CONFIG_HOTPLUG_CPU. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:11 +08:00
Herbert Xu	df2071bd08	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2010-05-03 11:28:58 +08:00
Steffen Klassert	97e3d94aac	padata: Dont scale the parallel objects with the cpus Scaling the maximum number of objects in the parallel codepath can lead to out of memory problems on bigsmp machines. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:16:13 +08:00
Tejun Heo	048c852051	perf: Fix resource leak in failure path of perf_event_open() perf_event_open() kfrees event after init failure which doesn't release all resources allocated by perf_event_alloc(). Use free_event() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@au1.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: <stable@kernel.org> LKML-Reference: <4BDBE237.1040809@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-01 13:11:25 +02:00
Frederic Weisbecker	feef47d0cb	hw-breakpoints: Get the number of available registers on boot dynamically The breakpoint generic layer assumes that archs always know in advance the static number of address registers available to host breakpoints through the HBP_NUM macro. However this is not true for every archs. For example Arm needs to get this information dynamically to handle the compatiblity between different versions. To solve this, this patch proposes to drop the static HBP_NUM macro and let the arch provide the number of available slots through a new hw_breakpoint_slots() function. For archs that have CONFIG_HAVE_MIXED_BREAKPOINTS_REGS selected, it will be called once as the number of registers fits for instruction and data breakpoints together. For the others it will be called first to get the number of instruction breakpoint registers and another time to get the data breakpoint registers, the targeted type is given as a parameter of hw_breakpoint_slots(). Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:14 +02:00
Frederic Weisbecker	f93a205411	hw-breakpoints: Handle breakpoint weight in allocation constraints Depending on their nature and on what an arch supports, breakpoints may consume more than one address register. For example a simple absolute address match usually only requires one address register. But an address range match may consume two registers. Currently our slot allocation constraints, that tend to reflect the limited arch's resources, always consider that a breakpoint consumes one slot. Then provide a way for archs to tell us the weight of a breakpoint through a new hw_breakpoint_weight() helper. This weight will be computed against the generic allocation constraints instead of a constant value. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:12 +02:00
Frederic Weisbecker	0102752e4c	hw-breakpoints: Separate constraint space for data and instruction breakpoints There are two outstanding fashions for archs to implement hardware breakpoints. The first is to separate breakpoint address pattern definition space between data and instruction breakpoints. We then have typically distinct instruction address breakpoint registers and data address breakpoint registers, delivered with separate control registers for data and instruction breakpoints as well. This is the case of PowerPc and ARM for example. The second consists in having merged breakpoint address space definition between data and instruction breakpoint. Address registers can host either instruction or data address and the access mode for the breakpoint is defined in a control register. This is the case of x86 and Super H. This patch adds a new CONFIG_HAVE_MIXED_BREAKPOINTS_REGS config that archs can select if they belong to the second case. Those will have their slot allocation merged for instructions and data breakpoints. The others will have a separate slot tracking between data and instruction breakpoints. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:11 +02:00
Frederic Weisbecker	b2812d031d	hw-breakpoints: Change/Enforce some breakpoints policies The current policies of breakpoints in x86 and SH are the following: - task bound breakpoints can only break on userspace addresses - cpu wide breakpoints can only break on kernel addresses The former rule prevents ptrace breakpoints to be set to trigger on kernel addresses, which is good. But as a side effect, we can't breakpoint on kernel addresses for task bound breakpoints. The latter rule simply makes no sense, there is no reason why we can't set breakpoints on userspace while performing cpu bound profiles. We want the following new policies: - task bound breakpoint can set userspace address breakpoints, with no particular privilege required. - task bound breakpoints can set kernelspace address breakpoints but must be privileged to do that. - cpu bound breakpoints can do what they want as they are privileged already. To implement these new policies, this patch checks if we are dealing with a kernel address breakpoint, if so and if the exclude_kernel parameter is set, we tell the user that the breakpoint is invalid, which makes a good generic ptrace protection. If we don't have exclude_kernel, ensure the user has the right privileges as kernel breakpoints are quite sensitive (risk of trap recursion attacks and global performance impacts). [ Paul Mundt: keep addr space check for sh signal delivery and fix double function declaration] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2010-05-01 04:32:10 +02:00
Frederic Weisbecker	87e9b20246	hw-breakpoints: Check disabled breakpoints again We stopped checking disabled breakpoints because we weren't allowing breakpoints on NULL addresses. And gdb tends to set NULL addresses on inactive breakpoints. But refusing NULL addresses was actually a regression that has been fixed now. There is no reason anymore to not validate inactive breakpoint settings. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:09 +02:00
Kei Tokunaga	bf81623542	[SCSI] add scsi trace core functions and put trace points Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com> Signed-off-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-04-30 12:51:10 -05:00
Kei Tokunaga	5a2e399595	[SCSI] ftrace: add __print_hex() __print_hex() prints values in an array in hex (w/o '0x') (space separated) EX) 92 33 32 f3 ee 4d Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com> Signed-off-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-04-30 12:50:22 -05:00
Frederic Weisbecker	913769f24e	lockdep: Simplify debug atomic ops Simplify debug_atomic_inc/dec by using this_cpu_inc/dec() instead of doing it through an indirect get_cpu_var() and a manual incrementation. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-04-30 19:15:56 +02:00
Frederic Weisbecker	8795d7717c	lockdep: Fix redundant_hardirqs_on incremented with irqs enabled When a path restore the flags while irqs are already enabled, we update the per cpu var redundant_hardirqs_on in a racy fashion and debug_atomic_inc() warns about this situation. In this particular case, loosing a few hits in a stat is not a big deal, so increment it without protection. v2: Don't bother with disabling irq, we can miss one count in rare situations Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-04-30 19:15:49 +02:00
Frederic Weisbecker	868c522b1b	Merge commit 'v2.6.34-rc6' into core/locking Merge reason: Further lockdep patches depend on per cpu updates made in -rc1.	2010-04-30 19:12:47 +02:00
Paul E. McKenney	8b46f88084	rcu: Fix RCU lockdep splat on freezer_fork path Add an RCU read-side critical section to suppress this false positive. Located-by: Eric Paris <eparis@parisplace.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <1271880131-3951-2-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-30 12:03:17 +02:00
Peter Zijlstra	8b08ca52f5	rcu: Fix RCU lockdep splat in set_task_cpu on fork path Add an RCU read-side critical section to suppress this false positive. Located-by: Eric Paris <eparis@parisplace.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <1271880131-3951-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-30 12:03:17 +02:00
Ingo Molnar	3ca50496c2	Merge commit 'v2.6.34-rc6' into perf/core Merge reason: update to the latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-30 09:56:44 +02:00
Oleg Nesterov	4d707b9f48	workqueue: change cancel_work_sync() to clear work->data In short: change cancel_work_sync(work) to mark this work as "never queued" upon return. When cancel_work_sync(work) succeeds, we know that this work can't be queued or running, and since we own WORK_STRUCT_PENDING nobody can change the bits in work->data under us. This means we can also clear the "cwq" part along with _PENDING bit lockless before return, unless the work is queued nobody can assume get_wq_data() is stable even under cwq->lock. This change can speedup the subsequent cancel/flush requests, and as Dmitry pointed out this simplifies the usage of work_struct's which can be queued on different workqueues. Consider this pseudo code from the input subsystem: struct workqueue_struct WQ; struct work_struct WORK; for (;;) { WQ = create_workqueue(); ... if (condition()) queue_work(WQ, WORK); ... cancel_work_sync(WORK); destroy_workqueue(WQ); } If condition() returns T and then F, cancel_work_sync() will crash the kernel because WORK->data still points to the already destroyed workqueue. With this patch the code like above becomes correct. Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-04-30 08:57:25 +02:00
Alan Stern	eef6a7d5c2	workqueue: warn about flush_scheduled_work() This patch (as1319) adds kerneldoc and a pointed warning to flush_scheduled_work(). Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-04-30 08:57:25 +02:00
Oleg Nesterov	47dd5be2d6	workqueue: flush_delayed_work: keep the original workqueue for re-queueing flush_delayed_work() always uses keventd_wq for re-queueing, but it should use the workqueue this dwork was queued on. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-04-30 07:24:51 +02:00
Jens Axboe	7407cf355f	Merge branch 'master' into for-2.6.35 Conflicts: fs/block_dev.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-29 09:36:24 +02:00
Steven Rostedt	37e44bc50d	tracing: Fix sleep time function profiling When sleep_time is off the function profiler ignores the time that a task is scheduled out. When the task is scheduled out a timestamp is taken. When the task is scheduled back in, the timestamp is compared to the current time and the saved calltimes are adjusted accordingly. But when stopping the function profiler, the sched switch hook that does this adjustment was stopped before shutting down the tracer. This allowed some tasks to not get their timestamps set when they scheduled out. When the function profiler started again, this would skew the times of the scheduler functions. This patch moves the stopping of the sched switch to after the function profiler is stopped. It also ignores zero set calltimes, which may happen on start up. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 21:04:24 -04:00
Chase Douglas	e330b3bcd8	tracing: Show sample std dev in function profiling When combined with function graph tracing the ftrace function profiler also prints the average run time of functions. While this gives us some good information, it doesn't tell us anything about the variance of the run times of the function. This change prints out the s^2 sample standard deviation alongside the average. This change adds one entry to the profile record structure. This increases the memory footprint of the function profiler by 1/3 on a 32-bit system, and by 1/5 on a 64-bit system when function graphing is enabled, though the memory is only allocated when the profiler is turned on. During the profiling, one extra line of code adds the squared calltime to the new record entry, so this should not adversly affect performance. Note that the square of the sample standard deviation is printed because there is no sqrt implementation for unsigned long long in the kernel. Signed-off-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1272304925-2436-1-git-send-email-chase.douglas@canonical.com> [ fixed comment about ns^2 -> us^2 conversion ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 18:23:15 -04:00
Steven Rostedt	a838b2e634	ring-buffer: Make benchmark handle missed events With the addition of the "missed events" flags that is stored in the commit field of the ring buffer page, the ring_buffer_benchmark was not updated to handle this. If events are missed, then the missed events flag is set in the ring buffer page, the benchmark will count that flag as part of the size of the page and will hit the BUG() when it tries to read beyond the page. The solution is simply to have the ring buffer benchmark mask off the extra bits. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 13:26:58 -04:00
David Miller	72c9ddfd4c	ring-buffer: Make non-consuming read less expensive with lots of cpus. When performing a non-consuming read, a synchronize_sched() is performed once for every cpu which is actively tracing. This is very expensive, and can make it take several seconds to open up the 'trace' file with lots of cpus. Only one synchronize_sched() call is actually necessary. What is desired is for all cpus to see the disabling state change. So we transform the existing sequence: for_each_cpu() { ring_buffer_read_start(); } where each ring_buffer_start() call performs a synchronize_sched(), into the following: for_each_cpu() { ring_buffer_read_prepare(); } ring_buffer_read_prepare_sync(); for_each_cpu() { ring_buffer_read_start(); } wherein only the single ring_buffer_read_prepare_sync() call needs to do the synchronize_sched(). The first phase, via ring_buffer_read_prepare(), allocates the 'iter' memory and increments ->record_disabled. In the second phase, ring_buffer_read_prepare_sync() makes sure this ->record_disabled state is visible fully to all cpus. And in the final third phase, the ring_buffer_read_start() calls reset the 'iter' objects allocated in the first phase since we now know that none of the cpus are adding trace entries any more. This makes openning the 'trace' file nearly instantaneous on a sparc64 Niagara2 box with 128 cpus tracing. Signed-off-by: David S. Miller <davem@davemloft.net> LKML-Reference: <20100420.154711.11246950.davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 13:06:35 -04:00
Jiri Olsa	62b915f106	tracing: Add graph output support for irqsoff tracer Add function graph output to irqsoff tracer. The graph output is enabled by setting new 'display-graph' trace option. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1270227683-14631-4-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 12:36:53 -04:00
Alessio Igor Bogani	b8bc1389b7	ptrace: Cleanup useless header BKL isn't present anymore into this file thus we can safely remove smp_lock.h inclusion. Signed-off-by: Alessio Igor Bogani <abogani@texware.it> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-04-26 23:42:51 +02:00
Jiri Olsa	d7a8d9e907	tracing: Have graph flags passed in to ouput functions Let the function graph tracer have custom flags passed to its output functions. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1270227683-14631-3-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-26 17:30:18 -04:00
Jiri Olsa	9106b69382	tracing: Add ftrace events for graph tracer Add ftrace events for graph tracer, so the graph output could be shared with other tracers. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1270227683-14631-2-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-26 16:55:08 -04:00
Andreas Schwab	46da276648	kernel/sys.c: fix compat uname machine On ppc64 you get this error: $ setarch ppc -R true setarch: ppc: Unrecognized architecture because uname still reports ppc64 as the machine. So mask off the personality flags when checking for PER_LINUX32. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-24 11:31:24 -07:00
Robert Richter	b971f06187	Merge commit 'tip/tracing/core' into oprofile/core Conflicts: drivers/oprofile/cpu_buffer.c Signed-off-by: Robert Richter <robert.richter@amd.com>	2010-04-23 16:47:51 +02:00
Ingo Molnar	77a7f2e94e	Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core	2010-04-23 11:25:31 +02:00
Ingo Molnar	70bce3ba77	Merge branch 'linus' into perf/core Merge reason: merge the latest fixes, update to latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:10:30 +02:00
Suresh Siddha	99bd5e2f24	sched: Fix select_idle_sibling() logic in select_task_rq_fair() Issues in the current select_idle_sibling() logic in select_task_rq_fair() in the context of a task wake-up: a) Once we select the idle sibling, we use that domain (spanning the cpu that the task is currently woken-up and the idle sibling that we found) in our wake_affine() decisions. This domain is completely different from the domain(we are supposed to use) that spans the cpu that the task currently woken-up and the cpu where the task previously ran. b) We do select_idle_sibling() check only for the cpu that the task is currently woken-up on. If select_task_rq_fair() selects the previously run cpu for waking the task, doing a select_idle_sibling() check for that cpu also helps and we don't do this currently. c) In the scenarios where the cpu that the task is woken-up is busy but with its HT siblings are idle, we are selecting the task be woken-up on the idle HT sibling instead of a core that it previously ran and currently completely idle. i.e., we are not taking decisions based on wake_affine() but directly selecting an idle sibling that can cause an imbalance at the SMT/MC level which will be later corrected by the periodic load balancer. Fix this by first going through the load imbalance calculations using wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu, then choose a possible idle sibling for waking up the task on. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:02:02 +02:00
Peter Zijlstra	669c55e9f9	sched: Pre-compute cpumask_weight(sched_domain_span(sd)) Dave reported that his large SPARC machines spend lots of time in hweight64(), try and optimize some of those needless cpumask_weight() invocations (esp. with the large offstack cpumasks these are very expensive indeed). Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:02:02 +02:00
Peter Zijlstra	74f5187ac8	sched: Cure load average vs NO_HZ woes Chase reported that due to us decrementing calc_load_task prematurely (before the next LOAD_FREQ sample), the load average could be scewed by as much as the number of CPUs in the machine. This patch, based on Chase's patch, cures the problem by keeping the delta of the CPU going into NO_HZ idle separately and folding that in on the next LOAD_FREQ update. This restores the balance and we get strict LOAD_FREQ period samples. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1271934490.1776.343.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:02:02 +02:00
Benjamin Herrenschmidt	4b40221048	mutex: Don't spin when the owner CPU is offline or other weird cases Due to recent load-balancer changes that delay the task migration to the next wakeup, the adaptive mutex spinning ends up in a live lock when the owner's CPU gets offlined because the cpu_online() check lives before the owner running check. This patch changes mutex_spin_on_owner() to return 0 (don't spin) in any case where we aren't sure about the owner struct validity or CPU number, and if the said CPU is offline. There is no point going back & re-evaluate spinning in corner cases like that, let's just go to sleep. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1271212509.13059.135.camel@pasglop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:00:28 +02:00
Jiri Kosina	6c9468e9eb	Merge branch 'master' into for-next	2010-04-23 02:08:44 +02:00
David Howells	e134d200d5	CRED: Fix a race in creds_are_invalid() in credentials debugging creds_are_invalid() reads both cred->usage and cred->subscribers and then compares them to make sure the number of processes subscribed to a cred struct never exceeds the refcount of that cred struct. The problem is that this can cause a race with both copy_creds() and exit_creds() as the two counters, whilst they are of atomic_t type, are only atomic with respect to themselves, and not atomic with respect to each other. This means that if creds_are_invalid() can read the values on one CPU whilst they're being modified on another CPU, and so can observe an evolving state in which the subscribers count now is greater than the usage count a moment before. Switching the order in which the counts are read cannot help, so the thing to do is to remove that particular check. I had considered rechecking the values to see if they're in flux if the test fails, but I can't guarantee they won't appear the same, even if they've changed several times in the meantime. Note that this can only happen if CONFIG_DEBUG_CREDENTIALS is enabled. The problem is only likely to occur with multithreaded programs, and can be tested by the tst-eintr1 program from glibc's "make check". The symptoms look like: CRED: Invalid credentials CRED: At include/linux/cred.h:240 CRED: Specified credentials: ffff88003dda5878 [real][eff] CRED: ->magic=43736564, put_addr=(null) CRED: ->usage=766, subscr=766 CRED: ->uid = { 0,0,0,0 } CRED: ->gid = { 0,0,0,0 } CRED: ->security is ffff88003d72f538 CRED: ->security {359, 359} ------------[ cut here ]------------ kernel BUG at kernel/cred.c:850! ... RIP: 0010:[<ffffffff81049889>] [<ffffffff81049889>] __invalid_creds+0x4e/0x52 ... Call Trace: [<ffffffff8104a37b>] copy_creds+0x6b/0x23f Note the ->usage=766 and subscr=766. The values appear the same because they've been re-read since the check was made. Reported-by: Roland McGrath <roland@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-22 09:14:29 +10:00
Frederic Weisbecker	cecbca96da	tracing: Dump either the oops's cpu source or all cpus buffers The ftrace_dump_on_oops kernel parameter, sysctl and sysrq let one dump every cpu buffers when an oops or panic happens. It's nice when you have few cpus but it may take ages if have many, plus you miss the real origin of the problem in all the cpu traces. Sometimes, all you need is to dump the cpu buffer that triggered the opps, most of the time it is our main interest. This patch modifies ftrace_dump_on_oops to handle this choice. The ftrace_dump_on_oops kernel parameter, when it comes alone, has the same behaviour than before. But ftrace_dump_on_oops=orig_cpu will only dump the buffer of the cpu that oops'ed. Similarly, sysctl kernel.ftrace_dump_on_oops=1 and echo 1 > /proc/sys/kernel/ftrace_dump_on_oops keep their previous behaviour. But setting 2 jumps into cpu origin dump mode. v2: Fix double setup v3: Fix spelling issues reported by Randy Dunlap v4: Also update __ftrace_dump in the selftests Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>	2010-04-21 23:11:42 +02:00
Ingo Molnar	ac0053fd51	Merge commit 'v2.6.34-rc5' into tracing/core Merge reason: pick up latest -rc's. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-21 09:47:05 +02:00
David Howells	eff30363c0	CRED: Fix double free in prepare_usermodehelper_creds() error handling Patch `570b8fb505`: Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Tue Mar 30 00:04:00 2010 +0100 Subject: CRED: Fix memory leak in error handling attempts to fix a memory leak in the error handling by making the offending return statement into a jump down to the bottom of the function where a kfree(tgcred) is inserted. This is, however, incorrect, as it does a kfree() after doing put_cred() if security_prepare_creds() fails. That will result in a double free if 'error' is jumped to as put_cred() will also attempt to free the new tgcred record by virtue of it being pointed to by the new cred record. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-21 09:20:35 +10:00
Zhang, Yanmin	39447b386c	perf: Enhance perf to allow for guest statistic collection from host Below patch introduces perf_guest_info_callbacks and related register/unregister functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and guest user space. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-04-19 12:35:33 +03:00
Paul E. McKenney	bc293d62b2	rcu: Make RCU lockdep check the lockdep_recursion variable The lockdep facility temporarily disables lockdep checking by incrementing the current->lockdep_recursion variable. Such disabling happens in NMIs and in other situations where lockdep might expect to recurse on itself. This patch therefore checks current->lockdep_recursion, disabling RCU lockdep splats when this variable is non-zero. In addition, this patch removes the "likely()", as suggested by Lai Jiangshan. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: David Miller <davem@davemloft.net> Tested-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <20100415195039.GA22623@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-19 08:37:19 +02:00
Mike Galbraith	09a40af524	sched: Fix UP update_avg() build warning update_avg() is only used for SMP builds, move it to the nearest SMP block. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1271309399.14779.17.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-15 09:36:47 +02:00
Ingo Molnar	b257c14ceb	Merge branch 'linus' into sched/core Merge reason: merge the latest fixes, update to -rc4. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-15 09:36:16 +02:00
Ingo Molnar	b5a80b7e91	Merge branch 'perf' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core	2010-04-15 09:16:51 +02:00
Divyesh Shah	b6ac23af2c	blkio: fix for modular blk-cgroup build After merging the block tree, 20100414's linux-next build (x86_64 allmodconfig) failed like this: ERROR: "get_gendisk" [block/blk-cgroup.ko] undefined! ERROR: "sched_clock" [block/blk-cgroup.ko] undefined! This happens because the two symbols aren't exported and hence not available when blk-cgroup code is built as a module. I've tried to stay consistent with the use of EXPORT_SYMBOL or EXPORT_SYMBOL_GPL with the other symbols in the respective files. Signed-off-by: Divyesh Shah <dpshah@google.com> Acked-by: Gui Jianfeng <guijianfeng@cn.fujitsu.cn> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-15 08:54:59 +02:00
Frederic Weisbecker	95476b64ab	perf: Fix hlist related build error hlist helpers need to be available for all software events, not only trace events. Pull them out outside the ifdef CONFIG_EVENT_TRACING section. Fixes: kernel/perf_event.c:4573: error: implicit declaration of function 'swevent_hlist_put' kernel/perf_event.c:4614: error: implicit declaration of function 'swevent_hlist_get' kernel/perf_event.c:5534: error: implicit declaration of function 'swevent_hlist_release Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1271281338-23491-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-15 01:34:46 +02:00
Masami Hiramatsu	93ccae7a22	tracing/kprobes: Support basic types on dynamic events Support basic types of integer (u8, u16, u32, u64, s8, s16, s32, s64) in kprobe tracer. With this patch, users can specify above basic types on each arguments after ':'. If omitted, the argument type is set as unsigned long (u32 or u64, arch-dependent). e.g. echo 'p account_system_time+0 hardirq_offset=%si:s32' > kprobe_events adds a probe recording hardirq_offset in signed-32bits value on the entry of account_system_time. Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100412171708.3790.18599.stgit@localhost6.localdomain6> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-04-14 17:26:28 -03:00
Frederic Weisbecker	df8290bf7e	perf: Make clock software events consistent with general exclusion rules The cpu/task clock events implement their own version of exclusion on top of exclude_user and exclude_kernel. The result is that when the event triggered in the kernel but we have exclude_kernel set, we try to rewind using task_pt_regs. There are two side effects of this: - we call task_pt_regs even on kernel threads, which doesn't give us the desired result. - if the event occured in the kernel, we shouldn't rewind to the user context. We want to actually ignore the event. get_irq_regs() will always give us the right interrupted context, so use its result and submit it to perf_exclude_context() that knows when an event must be ignored. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-04-14 18:20:33 +02:00
Frederic Weisbecker	76e1d9047e	perf: Store active software events in a hashlist Each time a software event triggers, we need to walk through the entire list of events from the current cpu and task contexts to retrieve a running perf event that matches. We also need to check a matching perf event is actually counting. This walk is wasteful and makes the event fast path scaling down with a growing number of events running on the same contexts. To solve this, we store the running perf events in a hashlist to get an immediate access to them against their type:event_id when they trigger. v2: - Fix SWEVENT_HLIST_SIZE definition (and re-learn some basic maths along the way) - Only allocate hlist for online cpus, but keep track of the refcount on offline possible cpus too, so that we allocate it if needed when it becomes online. - Drop the kref use as it's not adapted to our tricks anymore. v3: - Fix bad refcount check (address instead of value). Thanks to Eric Dumazet who spotted this. - While exiting cpu, move the hlist release out of the IPI path to lock the hlist mutex sanely. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-04-14 18:20:33 +02:00
Ingo Molnar	b15c7b1cee	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2010-04-14 12:15:23 +02:00
Dmitry Torokhov	97f5f0cd8c	Input: implement SysRq as a separate input handler Instead of keeping SysRq support inside of legacy keyboard driver split it out into a separate input handler (filter). This stops most SysRq input events from leaking into evdev clients (some events, such as first SysRq scancode - not keycode - event, are still leaked into both legacy keyboard and evdev). [martinez.javier@gmail.com: fix compile error when CONFIG_MAGIC_SYSRQ is not defined] Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2010-04-13 23:26:02 -07:00
Thomas Gleixner	6932bf37be	genirq: Remove IRQF_DISABLED from core code Remove all code which is related to IRQF_DISABLED from the core kernel code. IRQF_DISABLED still exists as a flag, but becomes a NOOP and will be removed after a grace period. That way we can easily revert to the previous behaviour by just restoring the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@osdl.org> LKML-Reference: <20100326000405.991244690@linutronix.de>	2010-04-13 16:36:40 +02:00
Ingo Molnar	e58aa3d2d0	genirq: Run irq handlers with interrupts disabled Running interrupt handlers with interrupts enabled can cause stack overflows. That has been observed with multiqueue NICs delivering all their interrupts to a single core. We might band aid that somehow by checking the interrupt stacks, but the real safe fix is to run the irq handlers with interrupts disabled. Drivers for whacky hardware still can reenable them in the handler itself, if the need arises. (They do already due to lockdep) The risk of doing this is rather low: - lockdep already enforces this - CONFIG_NOHZ has shaken out the drivers which relied on jiffies updates - time keeping is not longer sensitive to the timer interrupt being delayed Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@osdl.org> LKML-Reference: <20100326000405.758579387@linutronix.de>	2010-04-13 16:36:40 +02:00
Marc Zyngier	ae731f8d07	genirq: Introduce request_any_context_irq() Now that we enjoy threaded interrupts, we're starting to see irq_chip implementations (wm831x, pca953x) that make use of threaded interrupts for the controller, and nested interrupts for the client interrupt. It all works very well, with one drawback: Drivers requesting an IRQ must now know whether the handler will run in a thread context or not, and call request_threaded_irq() or request_irq() accordingly. The problem is that the requesting driver sometimes doesn't know about the nature of the interrupt, specially when the interrupt controller is a discrete chip (typically a GPIO expander connected over I2C) that can be connected to a wide variety of otherwise perfectly supported hardware. This patch introduces the request_any_context_irq() function that mostly mimics the usual request_irq(), except that it checks whether the irq level is configured as nested or not, and calls the right backend. On success, it also returns either IRQC_IS_HARDIRQ or IRQC_IS_NESTED. [ tglx: Made return value an enum, simplified code and made the export of request_any_context_irq GPL ] Signed-off-by: Marc Zyngier <maz@misterjones.org> Cc: <joachim.eastwood@jotron.com> LKML-Reference: <927ea285bd0c68934ddae1a47e44a9ba@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-13 16:36:39 +02:00
Thomas Gleixner	7c7145f6ac	Merge branch 'linus' into irq/core Reason: Get the upstream IRQF_DISABLED related changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-13 14:12:17 +02:00
John Stultz	6a867a3955	time: Remove xtime_cache With the earlier logarithmic time accumulation patch, xtime will now always be within one "tick" of the current time, instead of possibly half a second off. This removes the need for the xtime_cache value, which always stored the time at the last interrupt, so this patch cleans that up removing the xtime_cache related code. This patch also addresses an issue with an earlier version of this change, where xtime_cache was normalizing xtime, which could in some cases be not valid (ie: tv_nsec == NSEC_PER_SEC). This is fixed by handling the edge case in update_wall_time(). Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Petr Titěra <P.Titera@century.cz> LKML-Reference: <1270589451-30773-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-13 12:43:42 +02:00
Eric Paris	05b90496f2	security: remove dead hook acct Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:19 +10:00
Eric Paris	6307f8fee2	security: remove dead hook task_setgroups Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:18 +10:00
Eric Paris	06ad187e28	security: remove dead hook task_setgid Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:17 +10:00
Eric Paris	43ed8c3b45	security: remove dead hook task_setuid Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:16 +10:00
Eric Paris	0968d0060a	security: remove dead hook cred_commit Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:15 +10:00
Jiri Slaby	d88d4050dc	PM / Hibernate: user.c, fix SNAPSHOT_SET_SWAP_AREA handling When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device improperly by old_decode_dev and it results in an error while hibernating with s2disk. All users already pass the new device number, so switch to new_decode_dev(). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2010-04-10 22:28:56 +02:00
Arnd Bergmann	5534ecb2dd	ptrace: kill BKL in ptrace syscall The comment suggests that this usage is stale. There is no bkl in the exec path so if there is a race lurking there, the bkl in ptrace is not going to help in this regard. Overview of the possibility of "accidental" races this bkl might protect: - ptrace_traceme() is protected against task removal and concurrent read/write on current->ptrace as it locks write tasklist_lock. - arch_ptrace_attach() is serialized by ptrace_traceme() against concurrent PTRACE_TRACEME or PTRACE_ATTACH - ptrace_attach() is protected the same way ptrace_traceme() and in turn serializes arch_ptrace_attach() - ptrace_check_attach() does its own well described serializing too. There is no obvious race here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Roland McGrath <roland@redhat.com>	2010-04-10 15:34:21 +02:00
Linus Torvalds	2aedd192f7	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix sched_getaffinity()	2010-04-08 08:37:05 -07:00
Ingo Molnar	ca7e0c6120	Merge branch 'linus' into perf/core Semantic conflict: arch/x86/kernel/cpu/perf_event_intel_ds.c Merge reason: pick up latest fixes, fix the conflict Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-08 13:37:18 +02:00
Ingo Molnar	c1ab9cab75	Merge branch 'linus' into tracing/core Conflicts: include/linux/module.h kernel/module.c Semantic conflict: include/trace/events/module.h Merge reason: Resolve the conflict with upstream commit `5fbfb18` ("Fix up possibly racy module refcounting") Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-08 10:18:47 +02:00
KAMEZAWA Hiroyuki	a3a2e76c77	mm: avoid null-pointer deref in sync_mm_rss() - We weren't zeroing p->rss_stat[] at fork() - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel threads and was oopsing. - Make __sync_task_rss_stat() static, too. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648 [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)] Reported-by: Troels Liebe Bentsen <tlb@rapanden.dk> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> "Michael S. Tsirkin" <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-07 08:38:02 -07:00
Linus Torvalds	94c4fcec01	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Force MSI irq handlers to run with interrupts disabled	2010-04-06 13:03:22 -07:00
Carsten Emde	351b3f7a21	hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME The current version of schedule_hrtimeout() always uses the monotonic clock. Some system calls such as mq_timedsend() and mq_timedreceive(), however, require the use of the wall clock due to the definition of the system call. This patch provides the infrastructure to use schedule_hrtimeout() with a CLOCK_REALTIME timer. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Tested-by: Pradyumna Sampath <pradysam@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Veen <arjan@infradead.org> LKML-Reference: <20100402204331.167439615@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-06 21:50:03 +02:00
Arjan van de Ven	3bbb9ec946	timers: Introduce the concept of timer slack for legacy timers While HR timers have had the concept of timer slack for quite some time now, the legacy timers lacked this concept, and had to make do with round_jiffies() and friends. Timer slack is important for power management; grouping timers reduces the number of wakeups which in turn reduces power consumption. This patch introduces timer slack to the legacy timers using the following pieces: * A slack field in the timer struct * An api (set_timer_slack) that callers can use to set explicit timer slack * A default slack of 0.4% of the requested delay for callers that do not set any explicit slack * Rounding code that is part of mod_timer() that tries to group timers around jiffies values every 'power of two' (so quick timers will group around every 2, but longer timers will group around every 4, 8, 16, 32 etc) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: johnstul@us.ibm.com Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-06 21:50:02 +02:00
Anton Blanchard	84fba5ec91	sched: Fix sched_getaffinity() taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with the following error: sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument) This box has 128 threads and 16 bytes is enough to cover it. Commit `cd3d8031eb` (sched: sched_getaffinity(): Allow less than NR_CPUS length) is comparing this 16 bytes agains nr_cpu_ids. Fix it by comparing nr_cpu_ids to the number of bits in the cpumask we pass in. Signed-off-by: Anton Blanchard <anton@samba.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Sharyathi Nagesh <sharyath@in.ibm.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Mike Travis <travis@sgi.com> LKML-Reference: <20100406070218.GM5594@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-06 10:01:35 +02:00
Nick Piggin	5fbfb18d7a	Fix up possibly racy module refcounting Module refcounting is implemented with a per-cpu counter for speed. However there is a race when tallying the counter where a reference may be taken by one CPU and released by another. Reference count summation may then see the decrement without having seen the previous increment, leading to lower than expected count. A module which never has its actual reference drop below 1 may return a reference count of 0 due to this race. Module removal generally runs under stop_machine, which prevents this race causing bugs due to removal of in-use modules. However there are other real bugs in module.c code and driver code (module_refcount is exported) where the callers do not run under stop_machine. Fix this by maintaining running per-cpu counters for the number of module refcount increments and the number of refcount decrements. The increments are tallied after the decrements, so any decrement seen will always have its corresponding increment counted. The final refcount is the difference of the total increments and decrements, preventing a low-refcount from being returned. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-05 19:50:02 -07:00
Frederic Weisbecker	bd6d29c25b	lockstat: Make lockstat counting per cpu Locking statistics are implemented using global atomic variables. This is usually fine unless some path write them very often. This is the case for the function and function graph tracers that disable irqs for each entry saved (except if the function tracer is in preempt disabled only mode). And calls to local_irq_save/restore() increment hardirqs_on_events and hardirqs_off_events stats (or similar stats for redundant versions). Incrementing these global vars for each function ends up in too much cache bouncing if lockstats are enabled. To solve this, implement the debug_atomic_*() operations using per cpu vars. -v2: Use per_cpu() instead of get_cpu_var() to fetch the desired cpu vars on debug_atomic_read() -v3: Store the stats in a structure. No need for local_t as we are NMI/irq safe. -v4: Fix tons of build errors. I thought I had tested it but I probably forgot to select the relevant config. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1270505417-8144-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-04-06 00:15:37 +02:00
Eric Paris	449cedf099	audit: preface audit printk with audit There have been a number of reports of people seeing the message: "name_count maxed, losing inode data: dev=00:05, inode=3185" in dmesg. These usually lead to people reporting problems to the filesystem group who are in turn clueless what they mean. Eventually someone finds me and I explain what is going on and that these come from the audit system. The basics of the problem is that the audit subsystem never expects a single syscall to 'interact' (for some wish washy meaning of interact) with more than 20 inodes. But in fact some operations like loading kernel modules can cause changes to lots of inodes in debugfs. There are a couple real fixes being bandied about including removing the fixed compile time limit of 20 or not auditing changes in debugfs (or both) but neither are small and obvious so I am not sending them for immediate inclusion (I hope Al forwards a real solution next devel window). In the meantime this patch simply adds 'audit' to the beginning of the crap message so if a user sees it, they come blame me first and we can talk about what it means and make sure we understand all of the reasons it can happen and make sure this gets solved correctly in the long run. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-05 13:19:45 -07:00
Linus Torvalds	b66696e3c0	Merge branch 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc * 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: eeepc-wmi: include slab.h staging/otus: include slab.h from usbdrv.h percpu: don't implicitly include slab.h from percpu.h kmemcheck: Fix build errors due to missing slab.h include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h iwlwifi: don't include iwl-dev.h from iwl-devtrace.h x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h Fix up trivial conflicts in include/linux/percpu.h due to is_kernel_percpu_address() having been introduced since the slab.h cleanup with the percpu_up.c splitup.	2010-04-05 09:39:11 -07:00
Linus Torvalds	9e74e7c81a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: module: add stub for is_module_percpu_address percpu, module: implement and use is_kernel/module_percpu_address() module: encapsulate percpu handling better and record percpu_size	2010-04-05 09:16:37 -07:00
Lai Jiangshan	aa27497c2f	tracing: Fix uninitialized variable of tracing/trace output Because a local variable is not initialized, I got these when I did 'cat tracing/trace'. (not trace_pipe): CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770221: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446612133255294080 EVENTS] ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770222: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770222: lock_release: ffffffff816cfb98 dcache_lock See peek_next_entry(), it does not set *lost_events when we 'cat tracing/trace' Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4BB9A929.2000303@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-05 11:01:22 -04:00
Tejun Heo	336f5899d2	Merge branch 'master' into export-slabh	2010-04-05 11:37:28 +09:00
Linus Torvalds	8ce42c8b7f	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Always build the powerpc perf_arch_fetch_caller_regs version perf: Always build the stub perf_arch_fetch_caller_regs version perf, probe-finder: Build fix on Debian perf/scripts: Tuple was set from long in both branches in python_process_event() perf: Fix 'perf sched record' deadlock perf, x86: Fix callgraphs of 32-bit processes on 64-bit kernels perf, x86: Fix AMD hotplug & constraint initialization x86: Move notify_cpu_starting() callback to a later stage x86,kgdb: Always initialize the hw breakpoint attribute perf: Use hot regs with software sched switch/migrate events perf: Correctly align perf event tracing buffer	2010-04-04 12:13:10 -07:00
Linus Torvalds	0121b0c771	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock sched: Fix proc_sched_set_task()	2010-04-04 12:12:31 -07:00
Linus Torvalds	a8941b0ed0	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ring-buffer: Add missing unlock tracing: Fix lockdep warning in global_clock()	2010-04-04 12:12:19 -07:00
Arnd Bergmann	3326c1ceee	perf_event: Make perf fd non seekable Perf_event does not need seeking, so prevent it in order to get rid of default_llseek, which uses the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> [drop the nonseekable_open, not needed for anon inodes] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-04-04 15:27:58 +02:00
Ingo Molnar	22a4e4c435	Merge branch 'perf/urgent' into perf/core Conflicts: tools/perf/Makefile Merge reason: resolve the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-03 18:17:55 +02:00
Frederic Weisbecker	26d80aa782	perf: Always build the stub perf_arch_fetch_caller_regs version Now that software events use perf_arch_fetch_caller_regs() too, we need the stub version to be always built in for archs that don't implement it. Fixes the following build error in PARISC: kernel/built-in.o: In function `perf_event_task_sched_out': (.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs' Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org>	2010-04-03 12:22:05 +02:00
Linus Torvalds	5e123e5d9b	Merge branch 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb: Turn off tracing while in the debugger kgdb: use atomic_inc and atomic_dec instead of atomic_set kgdb: eliminate kgdb_wait(), all cpus enter the same way kgdbts,sh: Add in breakpoint pc offset for superh kgdb: have ebin2mem call probe_kernel_write once	2010-04-02 19:45:05 -07:00
Linus Torvalds	24b99d1576	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: Freezer: Fix buggy resume test for tasks frozen with cgroup freezer Freezer: Only show the state of tasks refusing to freeze	2010-04-02 19:44:42 -07:00
Jason Wessel	4da75b9cea	kgdb: Turn off tracing while in the debugger The kernel debugger should turn off kernel tracing any time the debugger is active and restore it on resume. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-02 14:58:19 -05:00
Jason Wessel	ae6bf53e02	kgdb: use atomic_inc and atomic_dec instead of atomic_set Memory barriers should be used for the kgdb cpu synchronization. The atomic_set() does not imply a memory barrier. Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-04-02 14:58:18 -05:00
Jason Wessel	62fae31219	kgdb: eliminate kgdb_wait(), all cpus enter the same way This is a kgdb architectural change to have all the cpus (master or slave) enter the same function. A cpu that hits an exception (wants to be the master cpu) will call kgdb_handle_exception() from the trap handler and then invoke a kgdb_roundup_cpu() to synchronize the other cpus and bring them into the kgdb_handle_exception() as well. A slave cpu will enter kgdb_handle_exception() from the kgdb_nmicallback() and set the exception state to note that the processor is a slave. Previously the salve cpu would have called kgdb_wait(). This change allows the debug core to change cpus without resuming the system in order to inspect arch specific cpu information. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-04-02 14:58:18 -05:00
Jason Wessel	a0279bd580	kgdb: have ebin2mem call probe_kernel_write once Rather than call probe_kernel_write() one byte at a time, process the whole buffer locally and pass the entire result in one go. This way, architectures that need to do special handling based on the length can do so, or we only end up calling memcpy() once. [sonic.zhang@analog.com: Reported original problem and preliminary patch] Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>	2010-04-02 14:58:17 -05:00
Peter Zijlstra	371fd7e7a5	sched: Add enqueue/dequeue flags In order to reduce the dependency on TASK_WAKING rework the enqueue interface to support a proper flags field. Replace the int wakeup, bool head arguments with an int flags argument and create the following flags: ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task, ENQUEUE_WAKING - the enqueue has relative vruntime due to having sched_class::task_waking() called, ENQUEUE_HEAD - the waking task should be places on the head of the priority queue (where appropriate). For symmetry also convert sched_class::dequeue() to a flags scheme. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:05 +02:00
Peter Zijlstra	cc87f76a60	sched: Fix nr_uninterruptible count The cpuload calculation in calc_load_account_active() assumes rq->nr_uninterruptible will not change on an offline cpu after migrate_nr_uninterruptible(). However the recent migrate on wakeup changes broke that and would result in decrementing the offline cpu's rq->nr_uninterruptible. Fix this by accounting the nr_uninterruptible on the waking cpu. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:04 +02:00
Peter Zijlstra	65cc8e4859	sched: Optimize task_rq_lock() Now that we hold the rq->lock over set_task_cpu() again, we can do away with most of the TASK_WAKING checks and reduce them again to set_cpus_allowed_ptr(). Removes some conditionals from scheduling hot-paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:04 +02:00
Peter Zijlstra	0017d73509	sched: Fix TASK_WAKING vs fork deadlock Oleg noticed a few races with the TASK_WAKING usage on fork. - since TASK_WAKING is basically a spinlock, it should be IRQ safe - since we set TASK_WAKING () without holding rq->lock it could be there still is a rq->lock holder, thereby not actually providing full serialization. () in fact we clear PF_STARTING, which in effect enables TASK_WAKING. Cure the second issue by not setting TASK_WAKING in sched_fork(), but only temporarily in wake_up_new_task() while calling select_task_rq(). Cure the first by holding rq->lock around the select_task_rq() call, this will disable IRQs, this however requires that we push down the rq->lock release into select_task_rq_fair()'s cgroup stuff. Because select_task_rq_fair() still needs to drop the rq->lock we cannot fully get rid of TASK_WAKING. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:03 +02:00
Oleg Nesterov	9084bb8246	sched: Make select_fallback_rq() cpuset friendly Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems with select_fallback_rq(). It can be called from any context and can't use any cpuset locks including task_lock(). It is called when the task doesn't have online cpus in ->cpus_allowed but ttwu/etc must be able to find a suitable cpu. I am not proud of this patch. Everything which needs such a fat comment can't be good even if correct. But I'd prefer to not change the locking rules in the code I hardly understand, and in any case I believe this simple change make the code much more correct compared to deadlocks we currently have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091027.GA9155@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:03 +02:00
Oleg Nesterov	6a1bdc1b57	sched: _cpu_down(): Don't play with current->cpus_allowed _cpu_down() changes the current task's affinity and then recovers it at the end. The problems are well known: we can't restore old_allowed if it was bound to the now-dead-cpu, and we can race with the userspace which can change cpu-affinity during unplug. _cpu_down() should not play with current->cpus_allowed at all. Instead, take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable() removes the dying cpu from cpu_online_mask. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091023.GA9148@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:03 +02:00
Oleg Nesterov	30da688ef6	sched: sched_exec(): Remove the select_fallback_rq() logic sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless. This can race with other CPUs updating our ->cpus_allowed, and this looks meaningless to me. The task is current and running, it must have online cpus in ->cpus_allowed, the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu, this likely means we raced with set_cpus_allowed() which was called for reason, why should sched_exec() retry and call ->select_task_rq() again? Change the code to call sched_class->select_task_rq() directly and do nothing if the returned cpu is wrong after re-checking under rq->lock. From now task_struct->cpus_allowed is always stable under TASK_WAKING, select_fallback_rq() is always called under rq-lock or the caller or the caller owns TASK_WAKING (select_task_rq). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091019.GA9141@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:02 +02:00
Oleg Nesterov	c1804d547d	sched: move_task_off_dead_cpu(): Remove retry logic The previous patch preserved the retry logic, but it looks unneeded. __migrate_task() can only fail if we raced with migration after we dropped the lock, but in this case the caller of set_cpus_allowed/etc must initiate migration itself if ->on_rq == T. We already fixed p->cpus_allowed, the changes in active/online masks must be visible to racer, it should migrate the task to online cpu correctly. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091014.GA9138@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:02 +02:00
Oleg Nesterov	1445c08d06	sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq() move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed lockless. We can race with set_cpus_allowed() running in parallel. Change it to take rq->lock around select_fallback_rq(). Note that it is not trivial to move this spin_lock() into select_fallback_rq(), we must recheck the task was not migrated after we take the lock and other callers do not need this lock. To avoid the races with other callers of select_fallback_rq() which rely on TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise. The owner of TASK_WAKING must update ->cpus_allowed and choose the correct CPU anyway, and the subsequent __migrate_task() is just meaningless because p->se.on_rq must be false. Alternatively, we could change select_task_rq() to take rq->lock right after it calls sched_class->select_task_rq(), but this looks a bit ugly. Also, change it to not assume irqs are disabled and absorb __migrate_task_irq(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091010.GA9131@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:01 +02:00
Oleg Nesterov	897f0b3c3f	sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code This patch just states the fact the cpusets/cpuhotplug interaction is broken and removes the deadlockable code which only pretends to work. - cpuset_lock() doesn't really work. It is needed for cpuset_cpus_allowed_locked() but we can't take this lock in try_to_wake_up()->select_fallback_rq() path. - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take cpuset_lock() and hangs forever because CPU is already dead and thus T can't be scheduled. - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock() which is not irq-safe, but try_to_wake_up() can be called from irq. Kill them, and change select_fallback_rq() to use cpu_possible_mask, like we currently do without CONFIG_CPUSETS. Also, with or without this patch, with or without CONFIG_CPUSETS, the callers of select_fallback_rq() can race with each other or with set_cpus_allowed() pathes. The subsequent patches try to to fix these problems. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091003.GA9123@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:01 +02:00
Li Zefan	32bd7eb5a7	sched: Remove remaining USER_SCHED code This is left over from commit `7c9414385e` ("sched: Remove USER_SCHED"") Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Howells <dhowells@redhat.com> LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:00 +02:00
Oleg Nesterov	47a70985e5	sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock Trivial typo fix. rq->migration_thread can be NULL after task_rq_unlock(), this is why we have "mt" which should be used instead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100330165829.GA18284@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:11:05 +02:00
Mike Galbraith	269484a492	sched: Fix proc_sched_set_task() Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks task_times(). Other places in kernel use nvcsw and nivcsw, which are being cleared as well, Clear task statistics only. Reported-by: Török Edwin <edwintorok@gmail.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1269940193.19286.14.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:06:40 +02:00
Ingo Molnar	c9494727cf	Merge branch 'linus' into sched/core Merge reason: update to latest upstream Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:03:08 +02:00
Ingo Molnar	ec5e61aabe	Merge branch 'perf/urgent' into perf/core Conflicts: arch/x86/kernel/cpu/perf_event.c Merge reason: Resolve the conflict, pick up fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 19:38:10 +02:00
Mike Galbraith	8bb39f9aa0	perf: Fix 'perf sched record' deadlock perf sched record can deadlock a box should the holder of handle->data->lock take an interrupt, and then attempt to acquire an rq lock held by a CPU trying to acquire the same lock. Disable interrupts. CPU0 CPU1 sched event with rq->lock held grab handle->data->lock spin on handle->data->lock interrupt try to grab rq->lock Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1269598293.6174.8.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 19:30:05 +02:00
Ingo Molnar	50d11d190a	Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/urgent	2010-04-02 19:29:17 +02:00
Frederic Weisbecker	e49a5bd381	perf: Use hot regs with software sched switch/migrate events Scheduler's task migration events don't work because they always pass NULL regs perf_sw_event(). The event hence gets filtered in perf_swevent_add(). Scheduler's context switches events use task_pt_regs() to get the context when the event occured which is a wrong thing to do as this won't give us the place in the kernel where we went to sleep but the place where we left userspace. The result is even more wrong if we switch from a kernel thread. Use the hot regs snapshot for both events as they belong to the non-interrupt/exception based events family. Unlike page faults or so that provide the regs matching the exact origin of the event, we need to save the current context. This makes the task migration event working and fix the context switch callchains and origin ip. Example: perf record -a -e cs Before: 10.91% ksoftirqd/0 0 [k] 0000000000000000 \| --- (nil) perf_callchain perf_prepare_sample __perf_event_overflow perf_swevent_overflow perf_swevent_add perf_swevent_ctx_event do_perf_sw_event __perf_sw_event perf_event_task_sched_out schedule run_ksoftirqd kthread kernel_thread_helper After: 23.77% hald-addon-stor [kernel.kallsyms] [k] schedule \| --- schedule \| \|--60.00%-- schedule_timeout \| wait_for_common \| wait_for_completion \| blk_execute_rq \| scsi_execute \| scsi_execute_req \| sr_test_unit_ready \| \| \| \|--66.67%-- sr_media_change \| \| media_changed \| \| cdrom_media_changed \| \| sr_block_media_changed \| \| check_disk_change \| \| cdrom_open v2: Always build perf_arch_fetch_caller_regs() now that software events need that too. They don't need it from modules, unlike trace events, so we keep the EXPORT_SYMBOL in trace_event_perf.c Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net>	2010-04-01 08:26:31 +02:00
Frederic Weisbecker	eb1e79611c	perf: Correctly align perf event tracing buffer The trace event buffer used by perf to record raw sample events is typed as an array of char and may then not be aligned to 8 by alloc_percpu(). But we need it to be aligned to 8 in sparc64 because we cast this buffer into a random structure type built by the TRACE_EVENT() macro to store the traces. So if a random 64 bits field is accessed inside, it may be not under an expected good alignment. Use an array of long instead to force the appropriate alignment, and perform a compile time check to ensure the size in byte of the buffer is a multiple of sizeof(long) so that its actual size doesn't get shrinked under us. This fixes unaligned accesses reported while using perf lock in sparc 64. Suggested-by: David Miller <davem@davemloft.net> Suggested-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-04-01 08:26:30 +02:00
Steven Rostedt	ff0ff84a07	ring-buffer: Add lost event count to end of sub buffer Currently, binary readers of the ring buffer only know where events were lost, but not how many events were lost at that location. This information is available, but it would require adding another field to the sub buffer header to include it. But when a event can not fit at the end of a sub buffer, it is written to the next sub buffer. This means there is a good chance that the buffer may have room to hold this counter. If it does, write the counter at the end of the sub buffer and set another flag in the data size field that states that this information exists. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:57:08 -04:00
Steven Rostedt	bc21b47842	tracing: Show the lost events in the trace_pipe output Now that the ring buffer can keep track of where events are lost. Use this information to the output of trace_pipe: hackbench-3588 [001] 1326.701660: lock_acquire: ffffffff816591e0 read rcu_read_lock hackbench-3588 [001] 1326.701661: lock_acquire: ffff88003f4091f0 &(&dentry->d_lock)->rlock hackbench-3588 [001] 1326.701664: lock_release: ffff88003f4091f0 &(&dentry->d_lock)->rlock CPU:1 [LOST 673 EVENTS] hackbench-3588 [001] 1326.702711: kmem_cache_free: call_site=ffffffff81102b85 ptr=ffff880026d96738 hackbench-3588 [001] 1326.702712: lock_release: ffff88003e1480a8 &mm->mmap_sem hackbench-3588 [001] 1326.702713: lock_acquire: ffff88003e1480a8 &mm->mmap_sem Even works with the function graph tracer: 2) ! 170.098 us \| } 2) 4.036 us \| rcu_irq_exit(); 2) 3.657 us \| idle_cpu(); 2) ! 190.301 us \| } CPU:2 [LOST 2196 EVENTS] 2) 0.853 us \| } /* cancel_dirty_page */ 2) \| remove_from_page_cache() { 2) 1.578 us \| _raw_spin_lock_irq(); 2) \| __remove_from_page_cache() { Note, it does not work with the iterator "trace" file, since it requires the use of consuming the page from the ring buffer to determine how many events were lost, which the iterator does not do. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:57:06 -04:00
Steven Rostedt	66a8cb95ed	ring-buffer: Add place holder recording of dropped events Currently, when the ring buffer drops events, it does not record the fact that it did so. It does inform the writer that the event was dropped by returning a NULL event, but it does not put in any place holder where the event was dropped. This is not a trivial thing to add because the ring buffer mostly runs in overwrite (flight recorder) mode. That is, when the ring buffer is full, new data will overwrite old data. In a produce/consumer mode, where new data is simply dropped when the ring buffer is full, it is trivial to add the placeholder for dropped events. When there's more room to write new data, then a special event can be added to notify the reader about the dropped events. But in overwrite mode, any new write can overwrite events. A place holder can not be inserted into the ring buffer since there never may be room. A reader could also come in at anytime and miss the placeholder. Luckily, the way the ring buffer works, the read side can find out if events were lost or not, and how many events. Everytime a write takes place, if it overwrites the header page (the next read) it updates a "overrun" variable that keeps track of the number of lost events. When a reader swaps out a page from the ring buffer, it can record this number, perfom the swap, and then check to see if the number changed, and take the diff if it has, which would be the number of events dropped. This can be stored by the reader and returned to callers of the reader. Since the reader page swap will fail if the writer moved the head page since the time the reader page set up the swap, this gives room to record the overruns without worrying about races. If the reader sets up the pages, records the overrun, than performs the swap, if the swap succeeds, then the overrun variable has not been updated since the setup before the swap. For binary readers of the ring buffer, a flag is set in the header of each sub page (sub buffer) of the ring buffer. This flag is embedded in the size field of the data on the sub buffer, in the 31st bit (the size can be 32 or 64 bits depending on the architecture), but only 27 bits needs to be used for the actual size (less actually). We could add a new field in the sub buffer header to also record the number of events dropped since the last read, but this will change the format of the binary ring buffer a bit too much. Perhaps this change can be made if the information on the number of events dropped is considered important enough. Note, the notification of dropped events is only used by consuming reads or peeking at the ring buffer. Iterating over the ring buffer does not keep this information because the necessary data is only available when a page swap is made, and the iterator does not swap out pages. Cc: Robert Richter <robert.richter@amd.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:57:04 -04:00
Steven Rostedt	eb0c53771f	tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set If modules are configured in the build but unloading of modules is not, then the refcnt is not defined. Place the get/put module tracepoints under CONFIG_MODULE_UNLOAD since it references this field in the module structure. As a side-effect, this patch also reduces the code when MODULE_UNLOAD is not set, because these unused tracepoints are not created. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:56:59 -04:00
Li Zefan	ae832d1e03	tracing: Remove side effect from module tracepoints that caused a GPF Remove the @refcnt argument, because it has side-effects, and arguments with side-effects are not skipped by the jump over disabled instrumentation and are executed even when the tracepoint is disabled. This was also causing a GPF as found by Randy Dunlap: Subject: 2.6.33 GP fault only when built with tracing LKML-Reference: <4BA2B69D.3000309@oracle.com> Note, the current 2.6.34-rc has a fix for the actual cause of the GPF, but this fixes one of its triggers. Tested-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BA97FA7.6040406@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:56:58 -04:00
Thomas Gleixner	753649dbc4	genirq: Force MSI irq handlers to run with interrupts disabled Network folks reported that directing all MSI-X vectors of their multi queue NICs to a single core can cause interrupt stack overflows when enough interrupts fire at the same time. This is caused by the fact that we run interrupt handlers by default with interrupts enabled unless the driver reuqests the interrupt with the IRQF_DISABLED set. The NIC handlers do not set this flag, so simultaneous interrupts can nest unlimited and cause the stack overflow. The only safe counter measure is to run the interrupt handlers with interrupts disabled. We can't switch to this mode in general right now, but it is safe to do so for MSI interrupts. Force IRQF_DISABLED for MSI interrupt handlers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Linus Torvalds <torvalds@osdl.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: stable@kernel.org	2010-03-31 15:48:38 +02:00
Linus Torvalds	246750ffa1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: CRED: Fix memory leak in error handling	2010-03-30 07:26:30 -07:00
Linus Torvalds	be3fd3cc7c	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Do not free zero sized per cpu areas x86: Make sure free_init_pages() frees pages on page boundary x86: Make smp_locks end with page alignment	2010-03-30 07:22:38 -07:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Mathieu Desnoyers	570b8fb505	CRED: Fix memory leak in error handling Fix a memory leak on an OOM condition in prepare_usermodehelper_creds(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-03-30 17:15:38 +11:00
Julia Lawall	292f60c0c4	ring-buffer: Add missing unlock In some error handling cases the lock is not unlocked. The return is converted to a goto, to share the unlock at the end of the function. A simplified version of the semantic patch that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r exists@ expression E1; identifier f; @@ f (...) { <+... * spin_lock_irq (E1,...); ... when != E1 * return ...; ...+> } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> LKML-Reference: <Pine.LNX.4.64.1003291736440.21896@ask.diku.dk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-29 15:23:24 -04:00
Li Zefan	e36673ec51	tracing: Fix lockdep warning in global_clock() # echo 1 > events/enable # echo global > trace_clock ------------[ cut here ]------------ WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190() ... ---[ end trace 3f86734a89416623 ]--- possible reason: unannotated irqs-on. ... There's no reason to use the raw_local_irq_save() in trace_clock_global. The local_irq_save() version is fine, and does not cause the bug in lockdep. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BA97FA1.7030606@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-29 15:16:44 -04:00
Ian Campbell	eed63519e3	x86: Do not free zero sized per cpu areas This avoids an infinite loop in free_early_partial(). Add a warning to free_early_partial() to catch future problems. -v5: put back start > end back into WARN_ONCE() -v6: use one line for warning, suggested by Linus -v7: more tests -v8: remove the function name as suggested by Johannes WARN_ONCE() will print out that function name. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Joel Becker <joel.becker@oracle.com> Tested-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <1269830604-26214-4-git-send-email-yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-29 18:55:40 +02:00
David Howells	a53f4f9efa	SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all instances. Change the remaining instances. This makes the debugfs file display the time mark and the owner's description again. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-29 09:14:47 -07:00
Dave Airlie	88be12c440	slow-work: use get_ref wrapper instead of directly calling get_ref Otherwise we can get an oops if the user has no get_ref/put_ref requirement. Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-29 09:13:30 -07:00
Tejun Heo	10fad5e46f	percpu, module: implement and use is_kernel/module_percpu_address() lockdep has custom code to check whether a pointer belongs to static percpu area which is somewhat broken. Implement proper is_kernel/module_percpu_address() and replace the custom code. On UP, percpu variables are regular static variables and can't be distinguished from them. Always return %false on UP. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@redhat.com>	2010-03-29 23:07:12 +09:00
Tejun Heo	259354deaa	module: encapsulate percpu handling better and record percpu_size Better encapsulate module static percpu area handling so that code outsidef of CONFIG_SMP ifdef doesn't deal with mod->percpu directly and add mod->percpu_size and record percpu_size in it. Both percpu fields are compiled out on UP. While at it, mark mod->percpu w/ __percpu. This is to prepare for is_module_percpu_address(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au>	2010-03-29 23:07:12 +09:00
Henrik Kretzschmar	975d260355	padata: Section cleanup This patch removes the __cupinit from padata_cpu_callback(), which is refered by the exportet function padata_alloc(). This could lead to problems if CONFIG_HOTPLUG_CPU is disabled, which should happen very often. WARNING: kernel/built-in.o(.text+0x7ffcb): Section mismatch in reference from the function padata_alloc() to the function .cpuinit.text:padata_cpu_callback() The function padata_alloc() references the function __cpuinit padata_cpu_callback(). This is often because padata_alloc lacks a __cpuinit annotation or the annotation of padata_cpu_callback is wrong. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-03-29 16:15:31 +08:00
Linus Torvalds	b72c40949b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1 x86/PCI: for host bridge address space collisions, show conflicting resource frv/PCI: remove redundant warnings x86/PCI: remove redundant warnings PCI: don't say we claimed a resource if we failed PCI quirk: Disable MSI on VIA K8T890 systems PCI quirk: RS780/RS880: work around missing MSI initialization PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present PCI: complain about devices that seem to be broken PCI: print resources consistently with %pR PCI: make disabled window printk style match the enabled ones PCI: break out primary/secondary/subordinate for readability PCI: for address space collisions, show conflicting resource resources: add interfaces that return conflict information PCI: cleanup error return for pcix get and set mmrbc functions PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions PCI: kill off pci_register_set_vga_state() symbol export. PCI: fix return value from pcix_get_max_mmrbc()	2010-03-26 16:34:29 -07:00
Matt Helsley	5a7aadfe2f	Freezer: Fix buggy resume test for tasks frozen with cgroup freezer When the cgroup freezer is used to freeze tasks we do not want to thaw those tasks during resume. Currently we test the cgroup freezer state of the resuming tasks to see if the cgroup is FROZEN. If so then we don't thaw the task. However, the FREEZING state also indicates that the task should remain frozen. This also avoids a problem pointed out by Oren Ladaan: the freezer state transition from FREEZING to FROZEN is updated lazily when userspace reads or writes the freezer.state file in the cgroup filesystem. This means that resume will thaw tasks in cgroups which should be in the FROZEN state if there is no read/write of the freezer.state file to trigger this transition before suspend. NOTE: Another "simple" solution would be to always update the cgroup freezer state during resume. However it's a bad choice for several reasons: Updating the cgroup freezer state is somewhat expensive because it requires walking all the tasks in the cgroup and checking if they are each frozen. Worse, this could easily make resume run in N^2 time where N is the number of tasks in the cgroup. Finally, updating the freezer state from this code path requires trickier locking because of the way locks must be ordered. Instead of updating the freezer state we rely on the fact that lazy updates only manage the transition from FREEZING to FROZEN. We know that a cgroup with the FREEZING state may actually be FROZEN so test for that state too. This makes sense in the resume path even for partially-frozen cgroups -- those that really are FREEZING but not FROZEN. Reported-by: Oren Ladaan <orenl@cs.columbia.edu> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: stable@kernel.org Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-03-26 23:51:44 +01:00
Xiaotian Feng	4f598458ea	Freezer: Only show the state of tasks refusing to freeze show_state will dump all tasks state, so if freezer failed to freeze any task, kernel will dump all tasks state and flood the dmesg log. This patch makes freezer only show state of tasks refusing to freeze. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-03-26 23:51:13 +01:00
Linus Torvalds	054319b5e2	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: time: Fix accumulation bug triggered by long delay. posix-cpu-timers: Reset expire cache when no timer is running timer stats: Fix del_timer_sync() and try_to_del_timer_sync() clockevents: Sanitize min_delta_ns adjustment and prevent overflows	2010-03-26 15:10:38 -07:00
Linus Torvalds	833961d81f	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align	2010-03-26 15:10:13 -07:00
Linus Torvalds	3cacf42462	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Use proper type in sched_getaffinity() kernel/sched.c: Suppress unused var warning sched: sched_getaffinity(): Allow less than NR_CPUS length	2010-03-26 15:09:59 -07:00
Linus Torvalds	309d1dcb5b	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Move two IRQ functions from .init.text to .text genirq: Protect access to irq_desc->action in can_request_irq() genirq: Prevent oneshot irq thread race	2010-03-26 15:09:06 -07:00
Linus Torvalds	8128f55a0b	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Remove excessive early_res debug output softlockup: Stop spurious softlockup messages due to overflow rcu: Fix local_irq_disable() CONFIG_PROVE_RCU=y false positives rcu: Fix tracepoints & lockdep false positive rcu: Make rcu_read_lock_bh_held() allow for disabled BH	2010-03-26 15:08:31 -07:00
Peter Zijlstra	faa4602e47	x86, perf, bts, mm: Delete the never used BTS-ptrace code Support for the PMU's BTS features has been upstreamed in v2.6.32, but we still have the old and disabled ptrace-BTS, as Linus noticed it not so long ago. It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without regard for other uses (perf) and doesn't provide the flexibility needed for perf either. Its users are ptrace-block-step and ptrace-bts, since ptrace-bts was never used and ptrace-block-step can be implemented using a much simpler approach. So axe all 3000 lines of it. That includes the locked_memory() APIs in mm/mlock.c as well. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Markus Metzger <markus.t.metzger@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <20100325135413.938004390@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-26 11:33:55 +01:00
Miao Xie	53feb29767	cpuset: alloc nodemask_t on the heap rather than the stack Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-24 16:31:21 -07:00
Miao Xie	5ab116c934	cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node cpuset_mem_spread_node() returns an offline node, and causes an oops. This patch fixes it by initializing task->mems_allowed to node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing memory hotplug. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Reported-by: Nick Piggin <npiggin@suse.de> Tested-by: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-24 16:31:21 -07:00
Li Zefan	9d34706f42	cgroups: remove duplicate include commit `e6a1105b` ("cgroups: subsystem module loading interface") and commit `c50cc752` ("sched, cgroups: Fix module export") result in duplicate including of module.h Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-24 16:31:19 -07:00
Henrik Kretzschmar	860652bfb8	genirq: Move two IRQ functions from .init.text to .text Both functions should not be marked as __init, since they be called from modules after the init section is freed. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> LKML-Reference: <1269431961-5731-1-git-send-email-henne@nachtwindheim.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-24 14:38:23 +01:00
Thomas Gleixner	cc8c3b7843	genirq: Protect access to irq_desc->action in can_request_irq() can_request_irq() accesses and dereferences irq_desc->action w/o holding irq_desc->lock. So action can be freed on another CPU before it's dereferenced. Unlikely, but ... Protect it with desc->lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-24 14:38:23 +01:00
Dimitri Sivanich	92d6b71ab9	genirq: Expose irq_desc->node in proc/irq Expose irq_desc->node as /proc/irq/*/node. This file provides device hardware locality information for apps desiring to include hardware locality in irq mapping decisions. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-24 14:10:03 +01:00
Bjorn Helgaas	66f1207bce	resources: add interfaces that return conflict information request_resource() and insert_resource() only return success or failure, which no information about what existing resource conflicted with the proposed new reservation. This patch adds request_resource_conflict() and insert_resource_conflict(), which return the conflicting resource. Callers may use this for better error messages or to adjust the new resource and retry the request. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-03-23 13:33:50 -07:00
John Stultz	e1292ba164	ntp: Make time_adjust static Now that no arches are accessing time_adjust directly, make it static. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1268968769-19209-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-23 17:19:37 +01:00
John Stultz	830ec0458c	time: Fix accumulation bug triggered by long delay. The logarithmic accumulation done in the timekeeping has some overflow protection that limits the max shift value. That means it will take more then shift loops to accumulate all of the cycles. This causes the shift decrement to underflow, which causes the loop to never exit. The simplest fix would be simply to do a: if (shift) shift--; However that is not optimal, as we know the cycle offset is larger then the interval << shift, the above would make shift drop to zero, then we would be spinning for quite awhile accumulating at interval chunks at a time. Instead, this patch only decreases shift if the offset is smaller then cycle_interval << shift. This makes sure we accumulate using the largest chunks possible without overflowing tick_length, and limits the number of iterations through the loop. This issue was found and reported by Sonic Zhang, who also tested the fix. Many thanks your explanation and testing! Reported-by: Sonic Zhang <sonic.adi@gmail.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> Tested-by: Sonic Zhang <sonic.adi@gmail.com> LKML-Reference: <1268948850-5225-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-23 16:41:01 +01:00
Ingo Molnar	d2f1e15b66	Merge commit 'v2.6.34-rc2' into perf/core Merge reason: Pick up latest perf fixes from upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-22 18:47:01 +01:00
Colin Ian King	8c2eb4805d	softlockup: Stop spurious softlockup messages due to overflow Ensure additions on touch_ts do not overflow. This can occur when the top 32 bits of the TSC reach 0xffffffff causing additions to touch_ts to overflow and this in turn generates spurious softlockup warnings. Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1268994482.1798.6.camel@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-21 19:30:13 +01:00
Steven Rostedt	2271048d1b	ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align The ring buffer uses 4 byte alignment while recording events into the buffer, even on 64bit machines. This saves space when there are lots of events being recorded at 4 byte boundaries. The ring buffer has a zero copy method to write into the buffer, with the reserving of space and then committing it. This may cause problems when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and PPC this is not an issue, but on some architectures this would cause an out-of-alignment exception. This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine if it is OK to use 4 byte alignments on 64 bit machines. If it is not, it forces the ring buffer event header to be 8 bytes and not 4, and will align the length of the data to be 8 byte aligned. This keeps the data payload at 8 byte alignments and will allow these machines to run without issue. The trick to this is that the header can be either 4 bytes or 8 bytes depending on the length of the data payload. The 4 byte header has a length field that supports up to 112 bytes. If the length of the data is more than 112, the length field is set to zero, and the actual length is stored in the next 4 bytes after the header. When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces zero in the 4 byte header forcing the length to be stored in the 4 byte array, even with a small data load. It also forces the length of the data load to be 8 byte aligned. The combination of these two guarantee that the data is always at 8 byte alignment. Tested-by: Frederic Weisbecker <fweisbec@gmail.com> (on sparc64) Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-18 23:11:35 -04:00
Linus Torvalds	f82c37e7bb	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits) perf: Fix unexported generic perf_arch_fetch_caller_regs perf record: Don't try to find buildids in a zero sized file perf: export perf_trace_regs and perf_arch_fetch_caller_regs perf, x86: Fix hw_perf_enable() event assignment perf, ppc: Fix compile error due to new cpu notifiers perf: Make the install relative to DESTDIR if specified kprobes: Calculate the index correctly when freeing the out-of-line execution slot perf tools: Fix sparse CPU numbering related bugs perf_event: Fix oops triggered by cpu offline/online perf: Drop the obsolete profile naming for trace events perf: Take a hot regs snapshot for trace events perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot perf/x86-64: Use frame pointer to walk on irq and process stacks lockdep: Move lock events under lockdep recursion protection perf report: Print the map table just after samples for which no map was found perf report: Add multiple event support perf session: Change perf_session post processing functions to take histogram tree perf session: Add storage for seperating event types in report perf session: Change add_hist_entry to take the tree root instead of session perf record: Add ID and to recorded event data when recording multiple events ...	2010-03-18 16:52:46 -07:00
Frederic Weisbecker	dcd5c1662d	perf: Fix unexported generic perf_arch_fetch_caller_regs perf_arch_fetch_caller_regs() is exported for the overriden x86 version, but not for the generic weak version. As a general rule, weak functions should not have their symbol exported in the same file they are defined. So let's export it on trace_event_perf.c as it is used by trace events only. This fixes: ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined! ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined! -v2: And also only build it if trace events are enabled. -v3: Fix changelog mistake Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-17 12:26:49 +01:00
KOSAKI Motohiro	8bc037fb89	sched: Use proper type in sched_getaffinity() Using the proper type fixes the following compiler warning: kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: torvalds@linux-foundation.org Cc: travis@sgi.com Cc: peterz@infradead.org Cc: drepper@redhat.com Cc: rja@sgi.com Cc: sharyath@in.ibm.com Cc: steiner@sgi.com LKML-Reference: <20100317090046.4C79.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-17 10:48:49 +01:00
Thomas Weber	8839316121	Fix typos in comments [Ss]ytem => [Ss]ystem udpate => update paramters => parameters orginal => original Signed-off-by: Thomas Weber <swirl@gmx.li> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-03-16 11:47:56 +01:00
Andrew Morton	c890692bf3	kernel/sched.c: Suppress unused var warning On UP: kernel/sched.c: In function 'wake_up_new_task': kernel/sched.c:2631: warning: unused variable 'cpu' Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 11:13:42 +01:00
Dan Carpenter	6427462bfa	sched: Remove some dead code This was left over from "7c9414385e sched: Remove USER_SCHED" Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> LKML-Reference: <20100315082148.GD18181@bicker> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 11:05:44 +01:00
Paul E. McKenney	e3818b8dce	rcu: Make rcu_read_lock_bh_held() allow for disabled BH Disabling BH can stand in for rcu_read_lock_bh(), and this patch updates rcu_read_lock_bh_held() to allow for this. In order to avoid include-file hell, this function is moved out of line to kernel/rcupdate.c. This fixes a false positive RCU warning. Reported-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100316000343.GA25857@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 09:57:49 +01:00
Frederic Weisbecker	1d199b1ad6	perf: Fix unexported generic perf_arch_fetch_caller_regs perf_arch_fetch_caller_regs() is exported for the overriden x86 version, but not for the generic weak version. As a general rule, weak functions should not have their symbol exported in the same file they are defined. So let's export it on trace_event_perf.c as it is used by trace events only. This fixes: ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined! ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined! -v2: And also only build it if trace events are enabled. -v3: Fix changelog mistake Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 09:27:27 +01:00
KOSAKI Motohiro	cd3d8031eb	sched: sched_getaffinity(): Allow less than NR_CPUS length [ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ] Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons. Unfortunately, glibc sched interface has the following definition: # define __CPU_SETSIZE 1024 # define __NCPUBITS (8 * sizeof (__cpu_mask)) typedef unsigned long int __cpu_mask; typedef struct { __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS]; } cpu_set_t; It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an ABI issue ... More recently, Sharyathi Nagesh reported following test program makes misterious syscall failure: ----------------------------------------------------------------------- #define _GNU_SOURCE #include<stdio.h> #include<errno.h> #include<sched.h> int main() { cpu_set_t set; if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0) printf("\n Call is failing with:%d", errno); } ----------------------------------------------------------------------- Because the kernel assumes len argument of sched_getaffinity() is bigger than NR_CPUS. But now it is not correct. Now we are faced with the following annoying dilemma, due to the limitations of the glibc interface built in years ago: (1) if we change glibc's __CPU_SETSIZE definition, we lost binary compatibility of _all_ application. (2) if we don't change it, we also lost binary compatibility of Sharyathi's use case. Then, I would propse to change the rule of the len argument of sched_getaffinity(). Old: len should be bigger than NR_CPUS New: len should be bigger than maximum possible cpu id This creates the following behavior: (A) In the real 4096 cpus machine, the above test program still return -EINVAL. (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost all machines in the world), the above can run successfully. Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means they can rebuild their programs. IOW we hope they are not annoyed by this issue ... Reported-by: Sharyathi Nagesh <sharyath@in.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Mike Travis <travis@sgi.com> LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-15 08:28:44 +01:00
Ingo Molnar	12b8aeee3e	Merge branch 'linus' into timers/core Conflicts: Documentation/feature-removal-schedule.txt Merge reason: Resolve the conflict, update to upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-15 08:17:33 +01:00
Linus Torvalds	80a186074e	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix pick_next_highest_task_rt() for cgroups sched: Cleanup: remove unused variable in try_to_wake_up() x86: Fix sched_clock_cpu for systems with unsynchronized TSC	2010-03-13 14:46:18 -08:00
Linus Torvalds	4e3eaddd14	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: locking: Make sparse work with inline spinlocks and rwlocks x86/mce: Fix RCU lockdep splats rcu: Increase RCU CPU stall timeouts if PROVE_RCU ftrace: Replace read_barrier_depends() with rcu_dereference_raw() rcu: Suppress RCU lockdep warnings during early boot rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare() rcu: Suppress __mpol_dup() false positive from RCU lockdep rcu: Make rcu_read_lock_sched_held() handle !PREEMPT rcu: Add control variables to lockdep_rcu_dereference() diagnostics rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU rcu: Use wrapper function instead of exporting tasklist_lock sched, rcu: Fix rcu_dereference() for RCU-lockdep rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU x86/gart: Unexport gart_iommu_aperture Fix trivial conflicts in kernel/trace/ftrace.c	2010-03-13 14:43:01 -08:00
Linus Torvalds	8655e7e3dd	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Do not record user stack trace from NMI context tracing: Disable buffer switching when starting or stopping trace tracing: Use same local variable when resetting the ring buffer function-graph: Init curr_ret_stack with ret_stack ring-buffer: Move disabled check into preempt disable section function-graph: Add tracing_thresh support to function_graph tracer tracing: Update the comm field in the right variable in update_max_tr function-graph: Use comment notation for func names of dangling '}' function-graph: Fix unused reference to ftrace_set_func() tracing: Fix warning in s_next of trace file ops tracing: Include irqflags headers from trace clock	2010-03-13 14:40:50 -08:00
Linus Torvalds	9fdfbc2bff	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Provide generic perf_sample_data initialization MAINTAINERS: Add Arnaldo as tools/perf/ co-maintainer perf trace: Don't use pager if scripting perf trace/scripting: Remove extraneous header read perf, ARM: Modify kuser rmb() call to compile for Thumb-2 x86/stacktrace: Don't dereference bad frame pointers perf archive: Don't try to collect files without a build-id perf_events, x86: Fixup fixed counter constraints perf, x86: Restrict the ANY flag perf, x86: rename macro in ARCH_PERFMON_EVENTSEL_ENABLE perf, x86: add some IBS macros to perf_event.h perf, x86: make IBS macros available in perf_event.h hw-breakpoints: Remove stub unthrottle callback x86/hw-breakpoints: Remove the name field perf: Remove pointless breakpoint union perf lock: Drop the buffers multiplexing dependency perf lock: Fix and add misc documentally things percpu: Add __percpu sparse annotations to hw_breakpoint	2010-03-13 14:39:42 -08:00
Steven Rostedt	b6345879cc	tracing: Do not record user stack trace from NMI context A bug was found with Li Zefan's ftrace_stress_test that caused applications to segfault during the test. Placing a tracing_off() in the segfault code, and examining several traces, I found that the following was always the case. The lock tracer was enabled (lockdep being required) and userstack was enabled. Testing this out, I just enabled the two, but that was not good enough. I needed to run something else that could trigger it. Running a load like hackbench did not work, but executing a new program would. The following would trigger the segfault within seconds: # echo 1 > /debug/tracing/options/userstacktrace # echo 1 > /debug/tracing/events/lock/enable # while :; do ls > /dev/null ; done Enabling the function graph tracer and looking at what was happening I finally noticed that all cashes happened just after an NMI. 1) \| copy_user_handle_tail() { 1) \| bad_area_nosemaphore() { 1) \| __bad_area_nosemaphore() { 1) \| no_context() { 1) \| fixup_exception() { 1) 0.319 us \| search_exception_tables(); 1) 0.873 us \| } [...] 1) 0.314 us \| __rcu_read_unlock(); 1) 0.325 us \| native_apic_mem_write(); 1) 0.943 us \| } 1) 0.304 us \| rcu_nmi_exit(); [...] 1) 0.479 us \| find_vma(); 1) \| bad_area() { 1) \| __bad_area() { After capturing several traces of failures, all of them happened after an NMI. Curious about this, I added a trace_printk() to the NMI handler to read the regs->ip to see where the NMI happened. In which I found out it was here: ffffffff8135b660 <page_fault>: ffffffff8135b660: 48 83 ec 78 sub $0x78,%rsp ffffffff8135b664: e8 97 01 00 00 callq ffffffff8135b800 <error_entry> What was happening is that the NMI would happen at the place that a page fault occurred. It would call rcu_read_lock() which was traced by the lock events, and the user_stack_trace would run. This would trigger a page fault inside the NMI. I do not see where the CR2 register is saved or restored in NMI handling. This means that it would corrupt the page fault handling that the NMI interrupted. The reason the while loop of ls helped trigger the bug, was that each execution of ls would cause lots of pages to be faulted in, and increase the chances of the race happening. The simple solution is to not allow user stack traces in NMI context. After this patch, I ran the above "ls" test for a couple of hours without any issues. Without this patch, the bug would trigger in less than a minute. Cc: stable@kernel.org Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:31:49 -05:00
Steven Rostedt	a2f8071428	tracing: Disable buffer switching when starting or stopping trace When the trace iterator is read, tracing_start() and tracing_stop() is called to stop tracing while the iterator is processing the trace output. These functions disable both the standard buffer and the max latency buffer. But if the wakeup tracer is running, it can switch these buffers between the two disables: buffer = global_trace.buffer; if (buffer) ring_buffer_record_disable(buffer); <<<--------- swap happens here buffer = max_tr.buffer; if (buffer) ring_buffer_record_disable(buffer); What happens is that we disabled the same buffer twice. On tracing_start() we can enable the same buffer twice. All ring_buffer_record_disable() must be matched with a ring_buffer_record_enable() or the buffer can be disable permanently, or enable prematurely, and cause a bug where a reset happens while a trace is commiting. This patch protects these two by taking the ftrace_max_lock to prevent a switch from occurring. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:30:21 -05:00
Steven Rostedt	283740c619	tracing: Use same local variable when resetting the ring buffer In the ftrace code that resets the ring buffer it references the buffer with a local variable, but then uses the tr->buffer as the parameter to reset. If the wakeup tracer is running, which can switch the tr->buffer with the max saved buffer, this can break the requirement of disabling the buffer before the reset. buffer = tr->buffer; ring_buffer_record_disable(buffer); synchronize_sched(); __tracing_reset(tr->buffer, cpu); If the tr->buffer is swapped, then the reset is not happening to the buffer that was disabled. This will cause the ring buffer to fail. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:29:20 -05:00
Steven Rostedt	ea14eb7140	function-graph: Init curr_ret_stack with ret_stack If the graph tracer is active, and a task is forked but the allocating of the processes graph stack fails, it can cause crash later on. This is due to the temporary stack being NULL, but the curr_ret_stack variable is copied from the parent. If it is not -1, then in ftrace_graph_probe_sched_switch() the following: for (index = next->curr_ret_stack; index >= 0; index--) next->ret_stack[index].calltime += timestamp; Will cause a kernel OOPS. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:28:02 -05:00
Lai Jiangshan	52fbe9cde7	ring-buffer: Move disabled check into preempt disable section The ring buffer resizing and resetting relies on a schedule RCU action. The buffers are disabled, a synchronize_sched() is called and then the resize or reset takes place. But this only works if the disabling of the buffers are within the preempt disabled section, otherwise a window exists that the buffers can be written to while a reset or resize takes place. Cc: stable@kernel.org Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B949E43.2010906@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:26:56 -05:00
Linus Torvalds	64d5aea300	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timekeeping: Prevent oops when GENERIC_TIME=n	2010-03-12 16:27:08 -08:00
Linus Torvalds	c32da02342	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits) doc: fix typo in comment explaining rb_tree usage Remove fs/ntfs/ChangeLog doc: fix console doc typo doc: cpuset: Update the cpuset flag file Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed Remove drivers/parport/ChangeLog Remove drivers/char/ChangeLog doc: typo - Table 1-2 should refer to "status", not "statm" tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h devres/irq: Fix devm_irq_match comment Remove reference to kthread_create_on_cpu tree-wide: Assorted spelling fixes tree-wide: fix 'lenght' typo in comments and code drm/kms: fix spelling in error message doc: capitalization and other minor fixes in pnp doc devres: typo fix s/dev/devm/ Remove redundant trailing semicolons from macros fix typo "definetly" -> "definitely" in comment tree-wide: s/widht/width/g typo in comments ... Fix trivial conflict in Documentation/laptops/00-INDEX	2010-03-12 16:04:50 -08:00
Dave Young	2edf5e4980	sysctl extern cleanup: lockdep Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move lockdep extern declarations to linux/lockdep.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	4f0e056fde	sysctl extern cleanup: rtmutex Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move max_lock_depth extern declaration to linux/rtmutex.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	c55b7c3e82	sysctl extern cleanup: acct Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move acct_parm extern declaration to linux/acct.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	15485a4682	sysctl extern cleanup: sg Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move sg_big_buff extern declaration to scsi/sg.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Doug Gilbert <dgilbert@interlog.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	5ed109103d	sysctl extern cleanup: module Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move modprobe_path extern declaration to linux/kmod.h Move modules_disabled extern declaration to linux/module.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	e5ab67726f	sysctl extern cleanup: rcu Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move rcutorture_runnable extern declaration to linux/rcupdate.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:08 -08:00
Dave Young	d33ed52d57	sysctl extern cleanup: signal Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move print_fatal_signals extern declaration to linux/signal.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:44 -08:00
Dave Young	eb5572fed5	sysctl extern cleanup: C_A_D Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move C_A_D extern variable declaration to linux/reboot.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:44 -08:00
Alexey Dobriyan	8467005da3	nsproxy: remove INIT_NSPROXY() Remove INIT_NSPROXY(), use C99 initializer. Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it. Note: headers trim will be done later, now it's quite pointless because results will be invalidated by merge window. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:40 -08:00
Oleg Nesterov	13aa9a6b0f	pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO instead of force_sig() zap_pid_ns_processes() uses force_sig(SIGKILL) to ensure SIGKILL will be delivered to sub-namespace inits as well. This is correct, but we are going to change force_sig_info() semantics. See http://bugzilla.kernel.org/show_bug.cgi?id=15395#c31 We can use send_sig_info(SEND_SIG_NOINFO) instead, since `614c517d7c` ("signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER()") SEND_SIG_NOINFO means "from user" and therefore send_signal() will get the correct from_ancestor_ns = T flag. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:40 -08:00
Veaceslav Falico	93c59907c6	copy_signal() cleanup: clean thread_group_cputime_init() Remove unneeded initializations in thread_group_cputime_init() and in posix_cpu_timers_init_group(). They are useless after kmem_cache_zalloc() was used in copy_signal(). Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:39 -08:00
Veaceslav Falico	4dd66e69d4	copy_signal() cleanup: kill taskstats_tgid_init() and acct_init_pacct() Kill unused functions taskstats_tgid_init() and acct_init_pacct() because we don't use them anywhere after using kmem_cache_zalloc() in copy_signal(). Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:39 -08:00
Veaceslav Falico	a56704ef6b	copy_signal() cleanup: use zalloc and remove initializations Use kmem_cache_zalloc() on signal creation and remove unneeded initialization lines in copy_signal(). Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:39 -08:00
Kirill A. Shutemov	a0a4db548e	cgroups: remove events before destroying subsystem state objects Events should be removed after rmdir of cgroup directory, but before destroying subsystem state objects. Let's take reference to cgroup directory dentry to do that. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:37 -08:00
Kirill A. Shutemov	4ab78683c1	cgroups: fix race between userspace and kernelspace Notify userspace about cgroup removing only after rmdir of cgroup directory to avoid race between userspace and kernelspace. eventfd are used to notify about two types of event: - control file-specific, like crossing memory threshold; - cgroup removing. To understand what really happen, userspace can check if the cgroup still exists. To avoid race beetween userspace and kernelspace we have to notify userspace about cgroup removing only after rmdir of cgroup directory. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:37 -08:00
Kirill A. Shutemov	0dea116876	cgroup: implement eventfd-based generic API for notifications This patchset introduces eventfd-based API for notifications in cgroups and implements memory notifications on top of it. It uses statistics in memory controler to track memory usage. Output of time(1) on building kernel on tmpfs: Root cgroup before changes: make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total Non-root cgroup before changes: make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total Root cgroup after changes (0 thresholds): make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total Non-root cgroup after changes (0 thresholds): make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total Root cgroup after changes (1 thresholds, never crossed): make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total Non-root cgroup after changes (1 thresholds, never crossed): make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total This patch: Introduce the write-only file "cgroup.event_control" in every cgroup. To register new notification handler you need: - create an eventfd; - open a control file to be monitored. Callbacks register_event() and unregister_event() must be defined for the control file; - write "<event_fd> <control_fd> <args>" to cgroup.event_control. Interpretation of args is defined by control file implementation; eventfd will be woken up by control file implementation or when the cgroup is removed. To unregister notification handler just close eventfd. If you need notification functionality for a control file you have to implement callbacks register_event() and unregister_event() in the struct cftype. [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:37 -08:00
Li Zefan	b70cc5fdb4	cgroups: clean up cgroup_pidlist_find() a bit Don't call get_pid_ns() before we locate/alloc the ns. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Serge Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	67523c48aa	cgroups: blkio subsystem as module Modify the Block I/O cgroup subsystem to be able to be built as a module. As the CFQ disk scheduler optionally depends on blk-cgroup, config options in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are enhanced to support the new module dependency. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	cf5d5941fd	cgroups: subsystem module unloading Provides support for unloading modular subsystems. This patch adds a new function cgroup_unload_subsys which is to be used for removing a loaded subsystem during module deletion. Reference counting of the subsystems' modules is moved from once (at load time) to once per attached hierarchy (in parse_cgroupfs_options and rebind_subsystems) (i.e., 0 or 1). Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	e6a1105ba0	cgroups: subsystem module loading interface Add interface between cgroups subsystem management and module loading This patch implements rudimentary module-loading support for cgroups - namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a module initcall, and a struct module pointer in struct cgroup_subsys. Several functions that might be wanted by modules have had EXPORT_SYMBOL added to them, but it's unclear exactly which functions want it and which won't. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	aae8aab403	cgroups: revamp subsys array This patch series provides the ability for cgroup subsystems to be compiled as modules both within and outside the kernel tree. This is mainly useful for classifiers and subsystems that hook into components that are already modules. cls_cgroup and blkio-cgroup serve as the example use cases for this feature. It provides an interface cgroup_load_subsys() and cgroup_unload_subsys() which modular subsystems can use to register and depart during runtime. The net_cls classifier subsystem serves as the example for a subsystem which can be converted into a module using these changes. Patch #1 sets up the subsys[] array so its contents can be dynamic as modules appear and (eventually) disappear. Iterations over the array are modified to handle when subsystems are absent, and the dynamic section of the array is protected by cgroup_mutex. Patch #2 implements an interface for modules to load subsystems, called cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module pointer in struct cgroup_subsys. Patch #3 adds a mechanism for unloading modular subsystems, which includes a more advanced rework of the rudimentary reference counting introduced in patch 2. Patch #4 modifies the net_cls subsystem, which already had some module declarations, to be configurable as a module, which also serves as a simple proof-of-concept. Part of implementing patches 2 and 4 involved updating css pointers in each css_set when the module appears or leaves. In doing this, it was discovered that css_sets always remain linked to the dummy cgroup, regardless of whether or not any subsystems are actually bound to it (i.e., not mounted on an actual hierarchy). The subsystem loading and unloading code therefore should keep in mind the special cases where the added subsystem is the only one in the dummy cgroup (and therefore all css_sets need to be linked back into it) and where the removed subsys was the only one in the dummy cgroup (and therefore all css_sets should be unlinked from it) - however, as all css_sets always stay attached to the dummy cgroup anyway, these cases are ignored. Any fix that addresses this issue should also make sure these cases are addressed in the subsystem loading and unloading code. This patch: Make subsys[] able to be dynamically populated to support modular subsystems This patch reworks the way the subsys[] array is used so that subsystems can register themselves after boot time, and enables the internals of cgroups to be able to handle when subsystems are not present or may appear/disappear. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Daisuke Nishimura	d7b9fff711	cgroup: introduce coalesce css_get() and css_put() Current css_get() and css_put() increment/decrement css->refcnt one by one. This patch add a new function __css_get(), which takes "count" as a arg and increment the css->refcnt by "count". And this patch also add a new arg("count") to __css_put() and change the function to decrement the css->refcnt by "count". These coalesce version of __css_get()/__css_put() will be used to improve performance of memcg's moving charge feature later, where instead of calling css_get()/css_put() repeatedly, these new functions will be used. No change is needed for current users of css_get()/css_put(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Daisuke Nishimura	2468c7234b	cgroup: introduce cancel_attach() Add cancel_attach() operation to struct cgroup_subsys. cancel_attach() can be used when can_attach() operation prepares something for the subsys, but we should rollback what can_attach() operation has prepared if attach task fails after we've succeeded in can_attach(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:35 -08:00
Christoph Hellwig	5cacdb4add	Add generic sys_olduname() Add generic implementations of the old and really old uname system calls. Note that sh only implements sys_olduname but not sys_oldolduname, but I'm not going to bother with another ifdef for that special case. m32r implemented an old uname but never wired it up, so kill it, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:32 -08:00
Christoph Hellwig	e28cbf2293	improve sys_newuname() for compat architectures On an architecture that supports 32-bit compat we need to override the reported machine in uname with the 32-bit value. Instead of doing this separately in every architecture introduce a COMPAT_UTS_MACHINE define in <asm/compat.h> and apply it directly in sys_newuname(). Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:32 -08:00
Christoph Hellwig	baed7fc9b5	Add generic sys_ipc wrapper Add a generic implementation of the ipc demultiplexer syscall. Except for s390 and sparc64 all implementations of the sys_ipc are nearly identical. There are slight differences in the types of the parameters, where mips and powerpc as the only 64-bit architectures with sys_ipc use unsigned long for the "third" argument as it gets casted to a pointer later, while it traditionally is an "int" like most other paramters. frv goes even further and uses unsigned long for all parameters execept for "ptr" which is a pointer type everywhere. The change from int to unsigned long for "third" and back to "int" for the others on frv should be fine due to the in-register calling conventions for syscalls (we already had a similar issue with the generic sys_ptrace), but I'd prefer to have the arch maintainers looks over this in details. Except for that h8300, m68k and m68knommu lack an impplementation of the semtimedop sub call which this patch adds, and various architectures have gets used - at least on i386 it seems superflous as the compat code on x86-64 and ia64 doesn't even bother to implement it. [akpm@linux-foundation.org: add sys_ipc to sys_ni.c] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:32 -08:00
Thomas Gleixner	802702e0c2	timer: Try to survive timer callback preempt_count leak If a timer callback leaks preempt_count we currently assert a BUG(). That makes it unnecessarily hard to retrieve information about the problem especially on laptops and headless stations. There is a decent chance to survive the preempt_count leak by restoring the preempt_count to the value before the callback. That allows in many cases to get valuable information about the root cause of the problem. We carried that fixup in preempt-rt for years and were able to decode such wreckage quite a few times. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Veen <arjan@infradead.org>	2010-03-12 22:40:44 +01:00
Thomas Gleixner	576da126a6	timer: Split out timer function call The ident level is starting to be annoying. More white space than actual code. Split out the timer function call into its own function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:43 +01:00
Uwe Kleine-König	06f71b922c	timer: Print function name for timer callbacks modifying preemption count A function scheduled with a timer must not exit with a different preempt count than it was entered. To make helping users running into the corresponding BUG() easier also print the name of the bad function not only its address. [ tglx: Sanitized printk ] Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: johnstul@us.ibm.com Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:42 +01:00
John Stultz	64ce4c2f52	time: Clean up warp_clock() warp_clock() currently accesses timekeeping internal state directly, which is unnecessary. Convert it to use the proper timekeeping interfaces. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:42 +01:00
Stanislaw Gruszka	c28739375b	cpu-timers: Avoid iterating over all threads in fastpath_timer_check() Spread p->sighand->siglock locking scope to make sure that fastpath_timer_check() never iterates over all threads. Without locking there is small possibility that signal->cputimer will stop running while we write values to signal->cputime_expires. Calling thread_group_cputime() from fastpath_timer_check() is not only bad because it is slow, also it is racy with __exit_signal() which can lead to invalid signal->{s,u}time values. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:41 +01:00
Stanislaw Gruszka	1f169f84d2	cpu-timers: Change SIGEV_NONE timer implementation When user sets up a timer without associated signal and process does not use any other cpu timers and does not exit, tsk->signal->cputimer is enabled and running forever. Avoid running the timer for no reason. I used below program to check patch does not break current user space visible behavior. #include <sys/time.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <unistd.h> #include <assert.h> void consume_cpu(void) { int i = 0; int count = 0; for(i=0; i<100000000; i++) count++; } int main(void) { int i; struct sigaction act; struct sigevent evt = { }; timer_t tid; struct itimerspec spec = { }; evt.sigev_notify = SIGEV_NONE; assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt, &tid) == 0); spec.it_value.tv_sec = 10; assert(timer_settime(tid, 0, &spec, NULL) == 0); for (i = 0; i < 30; i++) { consume_cpu(); memset(&spec, 0, sizeof(spec)); assert(timer_gettime(tid, &spec) == 0); printf("%lu.%09lu\n", (unsigned long) spec.it_value.tv_sec, (unsigned long) spec.it_value.tv_nsec); } assert(timer_delete(tid) == 0); return 0; } Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:40 +01:00
Stanislaw Gruszka	ae1a78eecc	cpu-timers: Return correct previous timer reload value According POSIX we need to correctly set old timer it_interval value when user request that in timer_settime(). Tested using below program. #include <sys/time.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <assert.h> int main(void) { struct sigaction act; struct sigevent evt = { }; timer_t tid; struct itimerspec spec, u_spec, k_spec; evt.sigev_notify = SIGEV_SIGNAL; evt.sigev_signo = SIGPROF; assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt, &tid) == 0); spec.it_value.tv_sec = 1; spec.it_value.tv_nsec = 2; spec.it_interval.tv_sec = 3; spec.it_interval.tv_nsec = 4; u_spec = spec; assert(timer_settime(tid, 0, &spec, NULL) == 0); spec.it_value.tv_sec = 5; spec.it_value.tv_nsec = 6; spec.it_interval.tv_sec = 7; spec.it_interval.tv_nsec = 8; assert(timer_settime(tid, 0, &spec, &k_spec) == 0); #define PRT(val) printf(#val ":\t%d/%d\n", (int) u_spec.val, (int) k_spec.val) PRT(it_value.tv_sec); PRT(it_value.tv_nsec); PRT(it_interval.tv_sec); PRT(it_interval.tv_nsec); return 0; } Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:40 +01:00
Stanislaw Gruszka	5eb9aa6414	cpu-timers: Cleanup arm_timer() Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:39 +01:00
Stanislaw Gruszka	f55db60904	cpu-timers: Simplify RLIMIT_CPU handling Let always set signal->cputime_expires expiration cache when setting new itimer, POSIX 1.b timer, and RLIMIT_CPU. Since we are initializing prof_exp expiration cache during fork(), this allows to remove "RLIMIT_CPU != inf" check from fastpath_timer_check() and do some other cleanups. Checked against regression using test cases from: http://marc.info/?l=linux-kernel&m=123749066504641&w=4 http://marc.info/?l=linux-kernel&m=123811277916642&w=2 Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:39 +01:00
Stanislaw Gruszka	15365c108e	posix-cpu-timers: Reset expire cache when no timer is running When a process deletes cpu timer or a timer expires we do not clear the expiration cache sig->cputimer_expires. As a result the fastpath_timer_check() which prevents us to loop over all threads in case no timer is active is not working and we run the slow path needlessly on every tick. Zero sig->cputimer_expires in stop_process_timers(). Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Spencer Candland <spencer@bluehost.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 19:12:18 +01:00
Andrew Morton	829b6c1ef4	timer stats: Fix del_timer_sync() and try_to_del_timer_sync() These functions forgot to run timer_stats_timer_clear_start_info(). It's unobvious what effect this has and whether it matters much - we won't be printing it out anyway if the timer's detached. Untested, just an Ingo trollpatch. [ Nevertheless correct - tglx ] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 19:11:29 +01:00
Thomas Gleixner	80a05b9ffa	clockevents: Sanitize min_delta_ns adjustment and prevent overflows The current logic which handles clock events programming failures can increase min_delta_ns unlimited and even can cause overflows. Sanitize it by: - prevent zero increase when min_delta_ns == 1 - limiting min_delta_ns to a jiffie - bail out if the jiffie limit is hit - add retries stats for /proc/timer_list so we can gather data Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 19:10:29 +01:00
Ingo Molnar	937779db13	Merge branch 'perf/urgent' into perf/core Merge reason: We want to queue up a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-12 10:20:59 +01:00
Mike Galbraith	beac4c7e4a	sched: Remove AFFINE_WAKEUPS feature Disabling affine wakeups is too horrible to contemplate. Remove the feature flag. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301890.6785.50.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:53 +01:00
Mike Galbraith	13814d42e4	sched: Remove ASYM_GRAN feature This features has been enabled for quite a while, after testing showed that easing preemption for light tasks was harmful to high priority threads. Remove the feature flag. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301675.6785.44.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:53 +01:00
Mike Galbraith	c6ee36c423	sched: Remove SYNC_WAKEUPS feature Sync wakeups are critical functionality with a long history. Remove it, we don't need the branch or icache footprint. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301817.6785.47.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:53 +01:00
Mike Galbraith	f2e74eeac0	sched: Remove WAKEUP_SYNC feature This feature never earned its keep, remove it. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301591.6785.42.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:52 +01:00
Mike Galbraith	5ca9880c6f	sched: Remove FAIR_SLEEPERS feature Our preemption model relies too heavily on sleeper fairness to disable it without dire consequences. Remove the feature, and save a branch or two. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301520.6785.40.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:52 +01:00
Mike Galbraith	6bc6cf2b61	sched: Remove NORMALIZED_SLEEPER This feature hasn't been enabled in a long time, remove effectively dead code. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301447.6785.38.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:52 +01:00
Mike Galbraith	8b911acdf0	sched: Fix select_idle_sibling() Don't bother with selection when the current cpu is idle. Recent load balancing changes also make it no longer necessary to check wake_affine() success before returning the selected sibling, so we now always use it. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301369.6785.36.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:51 +01:00
Mike Galbraith	21406928af	sched: Tweak sched_latency and min_granularity Allow LAST_BUDDY to kick in sooner, improving cache utilization as soon as a second buddy pair arrives on scene. The cost is latency starting to climb sooner, the tbenefit for tbench 8 on my Q6600 box is ~2%. No detrimental effects noted in normal idesktop usage. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301285.6785.34.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:51 +01:00
Mike Galbraith	a64692a3af	sched: Cleanup/optimize clock updates Now that we no longer depend on the clock being updated prior to enqueueing on migratory wakeup, we can clean up a bit, placing calls to update_rq_clock() exactly where they are needed, ie on enqueue, dequeue and schedule events. In the case of a freshly enqueued task immediately preempting, we can skip the update during preemption, as the clock was just updated by the enqueue event. We also save an unneeded call during a migratory wakeup by not updating the previous runqueue, where update_curr() won't be invoked. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301199.6785.32.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:50 +01:00
Mike Galbraith	e12f31d3e5	sched: Remove avg_overlap Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy was detrimentally affected by cross-cpu wakeups, this because we are missing the necessary call to update_curr(). This can't be fixed without increasing overhead in our already too fat fastpath. Additionally, with recent load balancing changes making us prefer to place tasks in an idle cache domain (which is good for compute bound loads), communicating tasks suffer when a sync wakeup, which would enable affine placement, is turned into a non-sync wakeup by SYNC_LESS. With one task on the runqueue, wake_affine() rejects the affine wakeup request, leaving the unfortunate where placed, taking frequent cache misses. Remove it, and recover some fastpath cycles. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301121.6785.30.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:50 +01:00
Mike Galbraith	b42e0c41a4	sched: Remove avg_wakeup Testing the load which led to this heuristic (nfs4 kbuild) shows that it has outlived it's usefullness. With intervening load balancing changes, I cannot see any difference with/without, so recover there fastpath cycles. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301062.6785.29.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:50 +01:00
Mike Galbraith	39c0cbe215	sched: Rate-limit nohz Entering nohz code on every micro-idle is costing ~10% throughput for netperf TCP_RR when scheduling cross-cpu. Rate limiting entry fixes this, but raises ticks a bit. On my Q6600, an idle box goes from ~85 interrupts/sec to 128. The higher the context switch rate, the more nohz entry costs. With this patch and some cycle recovery patches in my tree, max cross cpu context switch rate is improved by ~16%, a large portion of which of which is this ratelimiting. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301003.6785.28.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:49 +01:00
eranian@google.com	9b33fa6ba0	perf_events: Improve task_sched_in() This patch is an optimization in perf_event_task_sched_in() to avoid scheduling the events twice in a row. Without it, the perf_disable()/perf_enable() pair is invoked twice, thereby pinned events counts while scheduling flexible events and we go throuh hw_perf_enable() twice. By encapsulating, the whole sequence into perf_disable()/perf_enable() we ensure, hw_perf_enable() is going to be invoked only once because of the refcount protection. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268288765-5326-1-git-send-email-eranian@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 15:23:28 +01:00

... 8 9 10 11 12 ...

10270 Commits