linux

Author	SHA1	Message	Date
Peter Zijlstra	34f971f6f7	sched: Create special class for stop/migrate work In order to separate the stop/migrate work thread from the SCHED_FIFO implementation, create a special class for it that is of higher priority than SCHED_FIFO itself. This currently solves a problem where cpu-hotplug consumes so much cpu-time that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment timer pending on the now dead cpu. It is also required for when we add the planned deadline scheduling class above SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks. Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1285165776.2275.1022.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 18:41:58 +02:00
Peter Zijlstra	4924627423	sched: Unindent labels Labels should be on column 0. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-18 18:41:56 +02:00
Steven Rostedt	78c89ba121	tracing: Remove parent recording in latency tracer graph options Even though the parent is recorded with the normal function tracing of the latency tracers (irqsoff and wakeup), the function graph recording is bogus. This is due to the function graph messing with the return stack. The latency tracers pass in as the parent CALLER_ADDR0, which works fine for plain function tracing. But this causes bogus output with the graph tracer: 3) <idle>-0 \| d.s3. 0.000 us \| return_to_handler(); 3) <idle>-0 \| d.s3. 0.000 us \| _raw_spin_unlock_irqrestore(); 3) <idle>-0 \| d.s3. 0.000 us \| return_to_handler(); 3) <idle>-0 \| d.s3. 0.000 us \| trace_hardirqs_on(); The "return_to_handle()" call is the trampoline of the function graph tracer, and is meaningless in this context. Cc: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 10:53:38 -04:00
Steven Rostedt	5e6d2b9cfa	tracing: Use one prologue for the preempt irqs off tracer function tracers The preempt and irqsoff tracers have three types of function tracers. Normal function tracer, function graph entry, and function graph return. Each of these use a complex dance to prevent recursion and whether to trace the data or not (depending if interrupts are enabled or not). This patch moves the duplicate code into a single routine, to prevent future mistakes with modifying duplicate complex code. Cc: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 10:53:36 -04:00
Steven Rostedt	542181d376	tracing: Use one prologue for the wakeup tracer function tracers The wakeup tracer has three types of function tracers. Normal function tracer, function graph entry, and function graph return. Each of these use a complex dance to prevent recursion and whether to trace the data or not (depending on the wake_task variable). This patch moves the duplicate code into a single routine, to prevent future mistakes with modifying duplicate complex code. Cc: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 10:53:33 -04:00
Jiri Olsa	7495a5beaa	tracing: Graph support for wakeup tracer Add function graph support for wakeup latency tracer. The graph output is enabled by setting the 'display-graph' trace option. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1285243253-7372-4-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 10:53:30 -04:00
Jiri Olsa	0a772620a2	tracing: Make graph related irqs/preemptsoff functions global Move trace_graph_function() and print_graph_headers_flags() functions to the trace_function_graph.c to be globaly available. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1285243253-7372-3-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 10:53:28 -04:00
Jiri Olsa	a9d61173dc	tracing: Add proper check for irq_depth routines The check_irq_entry and check_irq_return could be called from graph event context. In such case there's no graph private data allocated. Adding checks to handle this case. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <20100924154102.GB1818@jolsa.brq.redhat.com> [ Fixed some grammar in the comments ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 10:53:25 -04:00
matt mooney	907f278409	tracing/trivial: Remove cast from void* Unnecessary cast from void* in assignment. Signed-off-by: matt mooney <mfm@muteddisk.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-18 10:53:22 -04:00
Nishanth Menon	e1f60b292f	PM: Introduce library for device-specific OPPs (v7) SoCs have a standard set of tuples consisting of frequency and voltage pairs that the device will support per voltage domain. These are called Operating Performance Points or OPPs. The actual definitions of OPP varies over silicon versions. For a specific domain, we can have a set of {frequency, voltage} pairs. As the kernel boots and more information is available, a default set of these are activated based on the precise nature of device. Further on operation, based on conditions prevailing in the system (such as temperature), some OPP availability may be temporarily controlled by the SoC frameworks. To implement an OPP, some sort of power management support is necessary hence this library depends on CONFIG_PM. Contributions include: Sanjeev Premi for the initial concept: http://patchwork.kernel.org/patch/50998/ Kevin Hilman for converting original design to device-based. Kevin Hilman and Paul Walmsey for cleaning up many of the function abstractions, improvements and data structure handling. Romit Dasgupta for using enums instead of opp pointers. Thara Gopinath, Eduardo Valentin and Vishwanath BS for fixes and cleanups. Linus Walleij for recommending this layer be made generic for usage in other architectures beyond OMAP and ARM. Mark Brown, Andrew Morton, Rafael J. Wysocki, Paul E. McKenney for valuable improvements. Discussions and comments from: http://marc.info/?l=linux-omap&m=126033945313269&w=2 http://marc.info/?l=linux-omap&m=125482970102327&w=2 http://marc.info/?t=125809247500002&r=1&w=2 http://marc.info/?l=linux-omap&m=126025973426007&w=2 http://marc.info/?t=128152609200064&r=1&w=2 http://marc.info/?t=128468723000002&r=1&w=2 incorporated. v1: http://marc.info/?t=128468723000002&r=1&w=2 Signed-off-by: Nishanth Menon <nm@ti.com> Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-10-17 01:57:50 +02:00
James Hogan	d33ac60bea	PM: Add sysfs attr for rechecking dev hash from PM trace If the device which fails to resume is part of a loadable kernel module it won't be checked at startup against the magic number stored in the RTC. Add a read-only sysfs attribute /sys/power/pm_trace_dev_match which contains a list of newline separated devices (usually just the one) which currently match the last magic number. This allows the device which is failing to resume to be found after the modules are loaded again. Signed-off-by: James Hogan <james@albanarts.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-10-17 01:57:50 +02:00
Rafael J. Wysocki	3624eb04c2	PM / Hibernate: Modify signature used to mark swap Since we are adding compression to the kernel's hibernate code, change signature used by it to mark swap spaces, so that earlier kernels don't attempt to restore compressed images they cannot handle. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz>	2010-10-17 01:57:49 +02:00
Rafael J. Wysocki	dbeeec5fe8	PM: Allow wakeup events to abort freezing of tasks If there is a wakeup event during the freezing of tasks, suspend or hibernation will fail anyway. Since try_to_freeze_tasks() can take up to 20 seconds to complete or fail, aborting it as soon as a wakeup event is detected improves the worst case wakeup latency. Based on a patch from Arve Hjønnevåg. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz>	2010-10-17 01:57:49 +02:00
Rafael J. Wysocki	d0941ead3f	PM / Hibernate: Make some boot messages look less scary The hibernate resume code checks if there is an image to resume from on every boot and, if the kernel is built with CONFIG_PM_DEBUG set and the image is not present, it prints some scary messages suggesting there was a boot error of some sort. Apparently, some users are confused by them, so make them look less scary and adjust the other hibernate resume debug messages to match them. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-10-17 01:57:48 +02:00
Rafael J. Wysocki	074037ec79	PM / Wakeup: Introduce wakeup source objects and event statistics (v3) Introduce struct wakeup_source for representing system wakeup sources within the kernel and for collecting statistics related to them. Make the recently introduced helper functions pm_wakeup_event(), pm_stay_awake() and pm_relax() use struct wakeup_source objects internally, so that wakeup statistics associated with wakeup devices can be collected and reported in a consistent way (the definition of pm_relax() is changed, which is harmless, because this function is not called directly by anyone yet). Introduce new wakeup-related sysfs device attributes in /sys/devices/.../power for reporting the device wakeup statistics. Change the global wakeup events counters event_count and events_in_progress into atomic variables, so that it is not necessary to acquire a global spinlock in pm_wakeup_event(), pm_stay_awake() and pm_relax(), which should allow us to avoid lock contention in these functions on SMP systems with many wakeup devices. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-10-17 01:57:43 +02:00
Rafael J. Wysocki	ac5c24ec1e	PM / Hibernate: Make default image size depend on total RAM size The default hibernation image size is currently hard coded and euqal to 500 MB, which is not a reasonable default on many contemporary systems. Make it equal 2/5 of the total RAM size (this is slightly below the maximum, i.e. 1/2 of the total RAM size, and seems to be generally suitable). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: M. Vefa Bicakci <bicave@superonline.com>	2010-10-17 01:57:43 +02:00
Rafael J. Wysocki	266f1a25ef	PM / Hibernate: Improve comments in hibernate_preallocate_memory() One comment in hibernate_preallocate_memory() is wrong, so fix it and add one more comment to clarify the meaning of the fixed one. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-10-17 01:57:42 +02:00
Rafael J. Wysocki	bcb5ba8b4e	PM / Runtime: Use alloc_workqueue() for creating the PM workqueue Although we need the PM workqueue to be freezable, we don't need it to be singlethread. Also, the number of concurrent work items running on a single CPU need not be constrained. For these reasons use alloc_workqueue() directly, with suitable arguments, instead of create_freezeable_workqueue(), to create the runtime PM workqueue. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Tejun Heo <tj@kernel.org>	2010-10-17 01:57:42 +02:00
Rafael J. Wysocki	ede890c2c0	PM: Fix unmet dependency warning from kconfig Fix the following build warning: warning: (PM_SLEEP_SMP && SMP && (ARCH_SUSPEND_POSSIBLE \|\| \ ARCH_HIBERNATION_POSSIBLE) && PM_SLEEP) selects HOTPLUG_CPU which \ has unmet direct dependencies (SMP && HOTPLUG) by selecting HOTPLUG along with CPU_HOTPLUG. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Randy Dunlap <randy.dunlap@oracle.com>	2010-10-17 01:57:42 +02:00
Bojan Smojver	f996fc9671	PM / Hibernate: Compress hibernation image with LZO Compress hibernation image with LZO in order to save on I/O and therefore time to hibernate/thaw. [rjw: Added hibernate=nocompress command line option instead of just nocompress which would be confusing, fixed a couple of compiler warnings, fixed kerneldoc comments, minor cleanups.] Signed-off-by: Bojan Smojver <bojan@rexursive.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-10-17 01:57:42 +02:00
Eric Dumazet	a9febbb4bd	sysctl: min/max bounds are optional sysctl check complains with a WARN() when proc_doulongvec_minmax() or proc_doulongvec_ms_jiffies_minmax() are used by a vector of longs (with more than one element), with no min or max value specified. This is unexpected, given we had a bug on this min/max handling :) Reported-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Miller <davem@davemloft.net> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-15 14:42:24 -07:00
Arnd Bergmann	6038f373a3	llseek: automatically add .llseek fop All file_operations should get a .llseek operation so we can make nonseekable_open the default for future file operations without a .llseek pointer. The three cases that we can automatically detect are no_llseek, seq_lseek and default_llseek. For cases where we can we can automatically prove that the file offset is always ignored, we use noop_llseek, which maintains the current behavior of not returning an error from a seek. New drivers should normally not use noop_llseek but instead use no_llseek and call nonseekable_open at open time. Existing drivers can be converted to do the same when the maintainer knows for certain that no user code relies on calling seek on the device file. The generated code is often incorrectly indented and right now contains comments that clarify for each added line why a specific variant was chosen. In the version that gets submitted upstream, the comments will be gone and I will manually fix the indentation, because there does not seem to be a way to do that using coccinelle. Some amount of new code is currently sitting in linux-next that should get the same modifications, which I will do at the end of the merge window. Many thanks to Julia Lawall for helping me learn to write a semantic patch that does all this. ===== begin semantic patch ===== // This adds an llseek= method to all file operations, // as a preparation for making no_llseek the default. // // The rules are // - use no_llseek explicitly if we do nonseekable_open // - use seq_lseek for sequential files // - use default_llseek if we know we access f_pos // - use noop_llseek if we know we don't access f_pos, // but we still want to allow users to call lseek // @ open1 exists @ identifier nested_open; @@ nested_open(...) { <+... nonseekable_open(...) ...+> } @ open exists@ identifier open_f; identifier i, f; identifier open1.nested_open; @@ int open_f(struct inode i, struct file f) { <+... ( nonseekable_open(...) \| nested_open(...) ) ...+> } @ read disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t read_f(struct file f, char p, size_t s, loff_t off) { <+... ( off = E \| off += E \| func(..., off, ...) \| E = off ) ...+> } @ read_no_fpos disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t read_f(struct file f, char p, size_t s, loff_t off) { ... when != off } @ write @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t write_f(struct file f, const char p, size_t s, loff_t off) { <+... ( off = E \| off += E \| func(..., off, ...) \| E = off ) ...+> } @ write_no_fpos @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t write_f(struct file f, const char p, size_t s, loff_t off) { ... when != off } @ fops0 @ identifier fops; @@ struct file_operations fops = { ... }; @ has_llseek depends on fops0 @ identifier fops0.fops; identifier llseek_f; @@ struct file_operations fops = { ... .llseek = llseek_f, ... }; @ has_read depends on fops0 @ identifier fops0.fops; identifier read_f; @@ struct file_operations fops = { ... .read = read_f, ... }; @ has_write depends on fops0 @ identifier fops0.fops; identifier write_f; @@ struct file_operations fops = { ... .write = write_f, ... }; @ has_open depends on fops0 @ identifier fops0.fops; identifier open_f; @@ struct file_operations fops = { ... .open = open_f, ... }; // use no_llseek if we call nonseekable_open //////////////////////////////////////////// @ nonseekable1 depends on !has_llseek && has_open @ identifier fops0.fops; identifier nso ~= "nonseekable_open"; @@ struct file_operations fops = { ... .open = nso, ... +.llseek = no_llseek, /* nonseekable / }; @ nonseekable2 depends on !has_llseek @ identifier fops0.fops; identifier open.open_f; @@ struct file_operations fops = { ... .open = open_f, ... +.llseek = no_llseek, / open uses nonseekable / }; // use seq_lseek for sequential files ///////////////////////////////////// @ seq depends on !has_llseek @ identifier fops0.fops; identifier sr ~= "seq_read"; @@ struct file_operations fops = { ... .read = sr, ... +.llseek = seq_lseek, / we have seq_read / }; // use default_llseek if there is a readdir /////////////////////////////////////////// @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier readdir_e; @@ // any other fop is used that changes pos struct file_operations fops = { ... .readdir = readdir_e, ... +.llseek = default_llseek, / readdir is present / }; // use default_llseek if at least one of read/write touches f_pos ///////////////////////////////////////////////////////////////// @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read.read_f; @@ // read fops use offset struct file_operations fops = { ... .read = read_f, ... +.llseek = default_llseek, / read accesses f_pos / }; @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, ... + .llseek = default_llseek, / write accesses f_pos / }; // Use noop_llseek if neither read nor write accesses f_pos /////////////////////////////////////////////////////////// @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; identifier write_no_fpos.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, .read = read_f, ... +.llseek = noop_llseek, / read and write both use no f_pos / }; @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write_no_fpos.write_f; @@ struct file_operations fops = { ... .write = write_f, ... +.llseek = noop_llseek, / write uses no f_pos / }; @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; @@ struct file_operations fops = { ... .read = read_f, ... +.llseek = noop_llseek, / read uses no f_pos / }; @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; @@ struct file_operations fops = { ... +.llseek = noop_llseek, / no read or write fn */ }; ===== End semantic patch ===== Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Julia Lawall <julia@diku.dk> Cc: Christoph Hellwig <hch@infradead.org>	2010-10-15 15:53:27 +02:00
Robert Richter	6268464b37	Merge remote branch 'tip/perf/core' into oprofile/core Conflicts: arch/arm/oprofile/common.c kernel/perf_event.c	2010-10-15 12:45:00 +02:00
Ingo Molnar	0fdf13606b	Merge branch 'tip/perf/recordmcount-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-10-15 06:12:28 +02:00
Steven Rostedt	cf4db2597a	ftrace: Rename config option HAVE_C_MCOUNT_RECORD to HAVE_C_RECORDMCOUNT The config option used by archs to let the build system know that the C version of the recordmcount works for said arch is currently called HAVE_C_MCOUNT_RECORD which enables BUILD_C_RECORDMCOUNT. To be more consistent with the name that all archs may use, it has been renamed to HAVE_C_RECORDMCOUNT. This will be less confusing since we are building a C recordmcount and not a mcount_record. Suggested-by: Ingo Molnar <mingo@elte.hu> Cc: <linux-arch@vger.kernel.org> Cc: Michal Marek <mmarek@suse.cz> Cc: linux-kbuild@vger.kernel.org Cc: John Reiser <jreiser@bitwagon.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-14 23:32:44 -04:00
Ingo Molnar	d9d572a9c0	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-10-15 05:12:45 +02:00
Steven Rostedt	72441cb1fd	ftrace/x86: Add support for C version of recordmcount This patch adds the support for the C version of recordmcount and compile times show ~ 12% improvement. After verifying this works, other archs can add: HAVE_C_MCOUNT_RECORD in its Kconfig and it will use the C version of recordmcount instead of the perl version. Cc: <linux-arch@vger.kernel.org> Cc: Michal Marek <mmarek@suse.cz> Cc: linux-kbuild@vger.kernel.org Cc: John Reiser <jreiser@bitwagon.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-14 16:52:41 -04:00
Salman Qazi	f13d4f979c	hrtimer: Preserve timer state in remove_hrtimer() The race is described as follows: CPU X CPU Y remove_hrtimer // state & QUEUED == 0 timer->state = CALLBACK unlock timer base timer->f(n) //very long hrtimer_start lock timer base remove_hrtimer // no effect hrtimer_enqueue timer->state = CALLBACK \| QUEUED unlock timer base hrtimer_start lock timer base remove_hrtimer mode = INACTIVE // CALLBACK bit lost! switch_hrtimer_base CALLBACK bit not set: timer->base changes to a different CPU. lock this CPU's timer base The bug was introduced with commit `ca109491f` (hrtimer: removing all ur callback modes) in 2.6.29 [ tglx: Feed new state via local variable and add a comment. ] Signed-off-by: Salman Qazi <sqazi@google.com> Cc: akpm@linux-foundation.org Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20101012142351.8485.21823.stgit@dungbeetle.mtv.corp.google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2010-10-14 13:29:59 +02:00
Takuya Yoshikawa	864616ee67	sched: Comment updates: fix default latency and granularity numbers Targeted preemption latency and minimal preemption granularity for CPU-bound tasks have been changed. This patch updates the comments about these values. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20101014160913.eb24fef4.yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-14 09:13:35 +02:00
Ingo Molnar	ed859ed3b0	Merge branch 'linus' into sched/core Merge reason: update from -rc5 to -almost-final Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-14 09:11:46 +02:00
Randy Dunlap	fb62db2ba9	futex: Fix kernel-doc notation & typos Convert futex_requeue() function parameters to use @name kernel-doc notation and add @fshared & @cmpval to prevent kernel-doc warnings. Add @list to struct futex_q. Fix a few typos. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20101013110234.89b06043.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-14 08:57:35 +02:00
Masami Hiramatsu	fd02e6f7ae	kprobes: Fix selftest to clear flags field for reusing probes Fix selftest to clear flags field for reusing probes because the flags field can be modified by Kprobes. This also set NULL to kprobe.addr instead of 0. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20101014031024.4100.50107.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-14 08:55:27 +02:00
Borislav Petkov	14cae9bd2f	tracing: Fix function-graph build warning on 32-bit Fix kernel/trace/trace_functions_graph.c: In function ‘trace_print_graph_duration’: kernel/trace/trace_functions_graph.c:652: warning: comparison of distinct pointer types lacks a cast when building 36-rc6 on a 32-bit due to the strict type check failing in the min() macro. Signed-off-by: Borislav Petkov <bp@alien8.de> Cc: Chase Douglas <chase.douglas@canonical.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20100929080823.GA13595@liondog.tnic> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-10-13 17:47:53 +02:00
Benjamin Herrenschmidt	4783f393de	Merge remote branch 'kumar/merge' into next	2010-10-13 16:18:36 +11:00
Thomas Gleixner	c0a19ebc01	genirq: Fix CONFIG_GENIRQ_NO_DEPRECATED=y build This option can be set to verify the full conversion to the new chip functions. Fix the fallout of the patch rework, so the core code compiles and works with it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-10-12 21:59:55 +02:00
Steven Rostedt	d01343244a	ring-buffer: Fix typo of time extends per page Time stamps for the ring buffer are created by the difference between two events. Each page of the ring buffer holds a full 64 bit timestamp. Each event has a 27 bit delta stamp from the last event. The unit of time is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events happen more than 134 milliseconds apart, a time extend is inserted to add more bits for the delta. The time extend has 59 bits, which is good for ~18 years. Currently the time extend is committed separately from the event. If an event is discarded before it is committed, due to filtering, the time extend still exists. If all events are being filtered, then after ~134 milliseconds a new time extend will be added to the buffer. This can only happen till the end of the page. Since each page holds a full timestamp, there is no reason to add a time extend to the beginning of a page. Time extends can only fill a page that has actual data at the beginning, so there is no fear that time extends will fill more than a page without any data. When reading an event, a loop is made to skip over time extends since they are only used to maintain the time stamp and are never given to the caller. As a paranoid check to prevent the loop running forever, with the knowledge that time extends may only fill a page, a check is made that tests the iteration of the loop, and if the iteration is more than the number of time extends that can fit in a page a warning is printed and the ring buffer is disabled (all of ftrace is also disabled with it). There is another event type that is called a TIMESTAMP which can hold 64 bits of data in the theoretical case that two events happen 18 years apart. This code has not been implemented, but the name of this event exists, as well as the structure for it. The size of a TIMESTAMP is 16 bytes, where as a time extend is only 8 bytes. The macro used to calculate how many time extends can fit on a page used the TIMESTAMP size instead of the time extend size cutting the amount in half. The following test case can easily trigger the warning since we only need to have half the page filled with time extends to trigger the warning: # cd /sys/kernel/debug/tracing/ # echo function > current_tracer # echo 'common_pid < 0' > events/ftrace/function/filter # echo > trace # echo 1 > trace_marker # sleep 120 # cat trace Enabling the function tracer and then setting the filter to only trace functions where the process id is negative (no events), then clearing the trace buffer to ensure that we have nothing in the buffer, then write to trace_marker to add an event to the beginning of a page, sleep for 2 minutes (only 35 seconds is probably needed, but this guarantees the bug), and then finally reading the trace which will trigger the bug. This patch fixes the typo and prevents the false positive of that warning. Reported-by: Hans J. Koch <hjk@linutronix.de> Tested-by: Hans J. Koch <hjk@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stable Kernel <stable@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-10-12 12:06:43 -04:00
Thomas Gleixner	5b8c4f23c5	printk: Make console_sem a semaphore not a pseudo mutex It needs to be investigated whether it can be replaced by a real mutex, but that needs more thought. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> LKML-Reference: <20100907125057.179587334@linutronix.de>	2010-10-12 17:36:10 +02:00
Thomas Gleixner	37eca0d64a	Merge branch 'linus' into core/locking Reason: Pull in the semaphore related changes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-10-12 17:27:28 +02:00
Thomas Gleixner	baa0d233af	genirq: Switch sparse_irq allocator to GFP_KERNEL The allocator functions are now called outside of preempt disabled regions. Switch to GFP_KERNEL. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:53:45 +02:00
Thomas Gleixner	a05a900a51	genirq: Make sparse_lock a mutex No callers from atomic regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:53:45 +02:00
Thomas Gleixner	78f90d91f3	genirq: Remove the now unused sparse irq leftovers The move_irq_desc() function was only used due to the problem that the allocator did not free the old descriptors. So the descriptors had to be moved in create_irq_nr(). That's history. The code would have never been able to move active interrupt descriptors on affinity settings. That can be done in a completely different way w/o all this horror. Remove all of it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:53:44 +02:00
Thomas Gleixner	b7b29338dc	genirq: Sanitize dynamic irq handling Use the cleanup functions of the dynamic allocator. No need to have separate implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:53:44 +02:00
Thomas Gleixner	b7d0d8258a	genirq: Remove arch_init_chip_data() This function should have not been there in the first place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:53:44 +02:00
Thomas Gleixner	7c5f13519a	Merge branch 'x86/urgent' of into irq/sparseirq Reason: Pull in the latest io_apic bugfixes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-10-12 16:41:26 +02:00
Thomas Gleixner	b683de2b3c	genirq: Query arch for number of early descriptors sparse irq sets up NR_IRQS_LEGACY irq descriptors and archs then go ahead and allocate more. Use the unused return value of arch_probe_nr_irqs() to let the architecture return the number of early allocations. Fix up all users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:08 +02:00
Thomas Gleixner	aa99ec0f3f	genirq: Use sane sparse allocator Make irq_to_desc_alloc_node() a wrapper around the new allocator. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:07 +02:00
Thomas Gleixner	06f6c3399e	genirq: Implement irq reservation Mark a range of interrupts as allocated. In the SPARSE_IRQ=n case we need this to update the bitmap for the legacy irqs so the enumerator via irq_get_next_irq() works. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-10-12 16:39:07 +02:00
Thomas Gleixner	a98d24b71b	genirq: Implement sane enumeration Use the allocator bitmap to lookup active interrupts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:07 +02:00
Thomas Gleixner	13bfe99e09	genirq: Prepare proc for real sparse irq support /proc/irq never removes any entries, but when irq descriptors can be freed for real this is necessary. Otherwise we'd reference a freed descriptor in /proc/irq/N Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:07 +02:00
Thomas Gleixner	1f5a5b87f7	genirq: Implement a sane sparse_irq allocator The current sparse_irq allocator has several short comings due to failures in the design or the lack of it: - Requires iteration over the number of active irqs to find a free slot (Some architectures have grown their own workarounds for this) - Removal of entries is not possible - Racy between create_irq_nr and destroy_irq (plugged by horrible callbacks) - Migration of active irq descriptors is not possible - No bulk allocation of irq ranges - Sprinkeled irq_desc references all over the place outside of kernel/irq/ (The previous chip functions series is addressing this issue) Implement a sane allocator which fixes the above short comings (though migration of active descriptors needs a full tree wide cleanup of the direct and mostly unlocked access to irq_desc). The new allocator still uses a radix_tree, but uses a bitmap for keeping track of allocated irq numbers. That allows: - Fast lookup of a free slot - Allows the removal of descriptors - Prevents the create/destroy race - Bulk allocation of consecutive irq ranges - Basic design is ready for migration of life descriptors after further cleanups The bitmap is also used in the SPARSE_IRQ=n case for lookup and raceless (de)allocation of irq numbers. So it removes the requirement for looping through the descriptor array to find slots. Right now it uses sparse_irq_lock to protect the bitmap and the radix tree, but after cleaning up all users we should be able convert that to a mutex and to switch the radix_tree and decriptor allocations to GFP_KERNEL. [ Folded in a bugfix from Yinghai Lu ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:07 +02:00
Thomas Gleixner	1318a481fc	genirq: Provide default irq init flags Arch code sets it's own irq_desc.status flags right after boot and for dynamically allocated interrupts. That might involve iterating over a huge array. Allow ARCH_IRQ_INIT_FLAGS to set separate flags aside of IRQ_DISABLED which is the default. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:06 +02:00
Thomas Gleixner	d895f51ebb	genirq: Remove export of kstat_irqs_cpu The statistics accessor is only used by proc/stats and show_interrupts(). Both are compiled in. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:06 +02:00
Thomas Gleixner	154cd387cd	genirq: Remove early_init_irq_lock_class() early_init_irq_lock_class() is called way before anything touches the irq descriptors. In case of SPARSE_IRQ=y this is a NOP operation because the radix tree is empty at this point. For the SPARSE_IRQ=n case it's sufficient to set the lock class in early_init_irq(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:06 +02:00
Thomas Gleixner	3795de236d	genirq: Distangle kernel/irq/handle.c kernel/irq/handle.c has become a dumpground for random code in random order. Split out the irq descriptor management and the dummy irq_chip implementation into separate files. Cleanup the include maze while at it. No code change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:05 +02:00
Thomas Gleixner	f303a6dd12	genirq: Sanitize irq_data accessors Get the data structure from the core and provide inline wrappers to access the irq_data members. Provide accessor inlines for irq_data as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:05 +02:00
Thomas Gleixner	442471848f	genirq: Provide status modifier Provide a irq_desc.status modifier function to cleanup the direct access to irq_desc in arch and driver code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:05 +02:00
Thomas Gleixner	e144710b30	genirq: Distangle irq.h Move irq_desc and internal functions out of irq.h Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 16:39:04 +02:00
John Blackwood	ad0cf3478d	perf: Fix incorrect copy_from_user() usage perf events: repair incorrect use of copy_from_user This makes the perf_event_period() return 0 instead of -EFAULT on success. Signed-off-by: John Blackwood<john.blackwood@ccur.com> Signed-off-by: Joe Korty <joe.korty@ccur.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100928220311.GA18145@tsunami.ccur.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-12 11:45:01 +02:00
H. Peter Anvin	8e4029ee35	Merge branch 'x86/urgent' into core/memblock Reason for merge: Forward-port urgent change to arch/x86/mm/srat_64.c to the memblock tree. Resolved Conflicts: arch/x86/mm/srat_64.c Originally-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2010-10-11 17:05:11 -07:00
Robert Richter	ad0f7cfaa8	Merge branch 'oprofile/urgent' (early part) into oprofile/perf	2010-10-11 19:26:50 +02:00
Matt Fleming	84c7991059	perf: New helper function for pmu name Introduce perf_pmu_name() helper function that returns the name of the pmu. This gives us a generic way to get the name of a pmu regardless of how an architecture identifies it internally. Signed-off-by: Matt Fleming <matt@console-pimps.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Robert Richter <robert.richter@amd.com>	2010-10-11 17:45:49 +02:00
Tejun Heo	6370a6ad3b	workqueue: add and use WQ_MEM_RECLAIM flag Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark WQ_RESCUER as internal and replace all external WQ_RESCUER usages to WQ_MEM_RECLAIM. This makes the API users express the intent of the workqueue instead of indicating the internal mechanism used to guarantee forward progress. This is also to make it cleaner to add more semantics to WQ_MEM_RECLAIM. For example, if deemed necessary, memory reclaim workqueues can be made highpri. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Steven Whitehouse <swhiteho@redhat.com>	2010-10-11 15:20:26 +02:00
Tejun Heo	30310045dd	workqueue: fix HIGHPRI handling in keep_working() The policy function keep_working() didn't check GCWQ_HIGHPRI_PENDING and could return %false with highpri work pending. This could lead to late execution of a highpri work which was delayed due to @max_active throttling if other works are actively consuming CPU cycles. For example, the following could happen. 1. Work W0 which burns CPU cycles. 2. Two works W1 and W2 are queued to a highpri wq w/ @max_active of 1. 3. W1 starts executing and W2 is put to delayed queue. W0 and W1 are both runnable. 4. W1 finishes which puts W2 to pending queue but keep_working() incorrectly returns %false and the worker goes to sleep. 5. W0 finishes and W2 starts execution. With this patch applied, W2 starts execution as soon as W1 finishes. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-10-11 12:09:30 +02:00
Ingo Molnar	7cd2541cf2	Merge commit 'v2.6.36-rc7' into perf/core Conflicts: arch/x86/kernel/module.c Merge reason: Resolve the conflict, pick up fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-08 10:46:27 +02:00
Ingo Molnar	153db80f8c	Merge commit 'v2.6.36-rc7' into core/memblock Merge reason: Update from -rc3 to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-08 09:15:00 +02:00
Linus Torvalds	6b0cd00bc3	Merge branch 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: HWPOISON: Stop shrinking at right page count HWPOISON: Report correct address granuality for AO huge page errors HWPOISON: Copy si_addr_lsb to user page-types.c: fix name of unpoison interface	2010-10-07 13:59:32 -07:00
Eric Dumazet	27b3d80a7b	sysctl: fix min/max handling in __do_proc_doulongvec_minmax() When proc_doulongvec_minmax() is used with an array of longs, and no min/max check requested (.extra1 or .extra2 being NULL), we dereference a NULL pointer for the second element of the array. Noticed while doing some changes in network stack for the "16TB problem" Fix is to not change min & max pointers in __do_proc_doulongvec_minmax(), so that all elements of the vector share an unique min/max limit, like proc_dointvec_minmax(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-07 13:31:21 -07:00
Peter Zijlstra	6506cf6ce6	sched: fix RCU lockdep splat from task_group() This addresses the following RCU lockdep splat: [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03 [0.052999] lockdep: fixing up alternatives. [0.054105] [0.054106] =================================================== [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ] [0.054999] --------------------------------------------------- [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection! [0.054999] [0.054999] other info that might help us debug this: [0.054999] [0.054999] [0.054999] rcu_scheduler_active = 1, debug_locks = 1 [0.054999] 3 locks held by swapper/1: [0.054999] #0: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a [0.054999] #1: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51 [0.054999] #2: (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113 [0.054999] [0.054999] stack backtrace: [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1 [0.054999] Call Trace: [0.054999] [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3 [0.054999] [<ffffffff810325c3>] task_group+0x7b/0x8a [0.054999] [<ffffffff810325e5>] set_task_rq+0x13/0x40 [0.054999] [<ffffffff814be39a>] init_idle+0xd2/0x113 [0.054999] [<ffffffff814be78a>] fork_idle+0xb8/0xc7 [0.054999] [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b [0.054999] [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b [0.054999] [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724 [0.054999] [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b [0.054999] [<ffffffff814be876>] _cpu_up+0xac/0x127 [0.054999] [<ffffffff814be946>] cpu_up+0x55/0x6a [0.054999] [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff [0.054999] [<ffffffff81003854>] kernel_thread_helper+0x4/0x10 [0.054999] [<ffffffff814c353c>] ? restore_args+0x0/0x30 [0.054999] [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff [0.054999] [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10 [0.056074] Booting Node 0, Processors #1lockdep: fixing up alternatives. [0.130045] #2lockdep: fixing up alternatives. [0.203089] #3 Ok. [0.275286] Brought up 4 CPUs [0.276005] Total of 4 processors activated (16017.17 BogoMIPS). The cgroup_subsys_state structures referenced by idle tasks are never freed, because the idle tasks should be part of the root cgroup, which is not removable. The problem is that while we do in-fact hold rq->lock, the newly spawned idle thread's cpu is not yet set to the correct cpu so the lockdep check in task_group(): lockdep_is_held(&task_rq(p)->lock) will fail. But this is a chicken and egg problem. Setting the CPU's runqueue requires that the CPU's runqueue already be set. ;-) So insert an RCU read-side critical section to avoid the complaint. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-10-07 10:41:07 -07:00
Dongdong Deng	4ee0a60392	rcu: using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value Using ACCESS_ONCE() to observe the jiffies_stall/rnp->qsmask value due to the caller didn't hold the root_rcu/rnp node's lock. Although use without ACCESS_ONCE() is safe due to the value loaded being used but once, the ACCESS_ONCE() is a good documentation aid -- the variables are being loaded without the services of a lock. Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com> CC: Dipankar Sarma <dipankar@in.ibm.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-10-07 10:41:06 -07:00
Paul E. McKenney	b0a0f667a3	sched: suppress RCU lockdep splat in task_fork_fair > =================================================== > [ INFO: suspicious rcu_dereference_check() usage. ] > --------------------------------------------------- > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection! > > other info that might help us debug this: > > rcu_scheduler_active = 1, debug_locks = 1 > 1 lock held by ifup/23517: > #0: (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108 > > stack backtrace: > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5 > Call Trace: > [<c075e219>] ? printk+0xf/0x16 > [<c0455842>] lockdep_rcu_dereference+0x74/0x7d > [<c0426854>] task_group+0x6d/0x79 > [<c042686e>] set_task_rq+0xe/0x57 > [<c042f79e>] task_fork_fair+0x57/0x108 > [<c042e965>] sched_fork+0x82/0xf9 > [<c04334b3>] copy_process+0x569/0xe8e > [<c0433ef0>] do_fork+0x118/0x262 > [<c076302f>] ? do_page_fault+0x16a/0x2cf > [<c044b80c>] ? up_read+0x16/0x2a > [<c04085ae>] sys_clone+0x1b/0x20 > [<c04030a5>] ptregs_clone+0x15/0x30 > [<c0402f1c>] ? sysenter_do_call+0x12/0x38 Here a newly created task is having its runqueue assigned. The new task is not yet on the tasklist, so cannot go away. This is therefore a false positive, suppress with an RCU read-side critical section. Reported-by: Ben Greear <greearb@candelatech.com Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Ben Greear <greearb@candelatech.com	2010-10-07 10:40:50 -07:00
Ingo Molnar	556ef63255	Merge commit 'v2.6.36-rc7' into core/rcu Merge reason: Update from -rc3 to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-07 09:43:45 +02:00
Ingo Molnar	d4f8f217b8	Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu	2010-10-07 09:43:11 +02:00
Andi Kleen	a337fdac7a	HWPOISON: Copy si_addr_lsb to user The original hwpoison code added a new siginfo field si_addr_lsb to pass the granuality of the fault address to user space. Unfortunately this field was never copied to user space. Fix this here. I added explicit checks for the MCEERR codes to avoid having to patch all potential callers to initialize the field. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2010-10-07 09:41:25 +02:00
Paul E. McKenney	773e3f9357	rcu: move check from rcu_dereference_bh to rcu_read_lock_bh_held As suggested by Linus, push the irqs_disabled() down to the rcu_read_lock_bh_held() level so that all callers get the benefit of the correct check. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-10-05 14:03:02 -07:00
Linus Torvalds	e1d9694cae	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: rcu_read_lock_bh_held(): disabling irqs also disables bh generic-ipi: Fix deadlock in __smp_call_function_single	2010-10-05 13:07:43 -07:00
Linus Torvalds	5336377d62	modules: Fix module_bug_list list corruption race With all the recent module loading cleanups, we've minimized the code that sits under module_mutex, fixing various deadlocks and making it possible to do most of the module loading in parallel. However, that whole conversion totally missed the rather obscure code that adds a new module to the list for BUG() handling. That code was doubly obscure because (a) the code itself lives in lib/bugs.c (for dubious reasons) and (b) it gets called from the architecture-specific "module_finalize()" rather than from generic code. Calling it from arch-specific code makes no sense what-so-ever to begin with, and is now actively wrong since that code isn't protected by the module loading lock any more. So this commit moves the "module_bug_{finalize,cleanup}()" calls away from the arch-specific code, and into the generic code - and in the process protects it with the module_mutex so that the list operations are now safe. Future fixups: - move the module list handling code into kernel/module.c where it belongs. - get rid of 'module_bug_list' and just use the regular list of modules (called 'modules' - imagine that) that we already create and maintain for other reasons. Reported-and-tested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Adrian Bunk <bunk@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-05 11:29:27 -07:00
Tejun Heo	cdadf0097c	workqueue: add queue_work and activate_work trace points These two tracepoints allow tracking when and how a work is queued and activated. This patch is based on Frederic's patch to add queue_work trace point. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>	2010-10-05 10:49:55 +02:00
Tejun Heo	97bd234701	workqueue: prepare for more tracepoints Define workqueue_work event class and use it for workqueue_execute_end trace point. Also, move trace/events/workqueue.h include downwards such that all struct definitions are visible to it. This is to prepare for more tracepoints and doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>	2010-10-05 10:41:14 +02:00
Jan Blunck	38d018dba3	BKL: Remove BKL from cgroup The BKL is only used in remount_fs and get_sb that are both protected by the superblocks s_umount rw_semaphore. Therefore it is safe to remove the BKL entirely. Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2010-10-04 21:10:42 +02:00
Jan Blunck	db71922217	BKL: Explicitly add BKL around get_sb/fill_super This patch is a preparation necessary to remove the BKL from do_new_mount(). It explicitly adds calls to lock_kernel()/unlock_kernel() around get_sb/fill_super operations for filesystems that still uses the BKL. I've read through all the code formerly covered by the BKL inside do_kern_mount() and have satisfied myself that it doesn't need the BKL any more. do_kern_mount() is already called without the BKL when mounting the rootfs and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called from various places without BKL: simple_pin_fs(), nfs_do_clone_mount() through nfs_follow_mountpoint(), afs_mntpt_do_automount() through afs_mntpt_follow_link(). Both later functions are actually the filesystems follow_link inode operation. vfs_kern_mount() is calling the specified get_sb function and lets the filesystem do its job by calling the given fill_super function. Therefore I think it is safe to push down the BKL from the VFS to the low-level filesystems get_sb/fill_super operation. [arnd: do not add the BKL to those file systems that already don't use it elsewhere] Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Christoph Hellwig <hch@infradead.org>	2010-10-04 21:10:10 +02:00
Thomas Gleixner	bd15141226	genirq: Provide config option to disable deprecated code This option covers now the old chip functions and the irq_desc data fields which are moving to struct irq_data. More stuff will follow. Pretty handy for testing a conversion, whether something broke or not. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 13:40:24 +02:00
Stephane Eranian	540804b5c5	perf_events: Fix invalid pointer when pid is invalid This patch fixes an error in perf_event_open() when the pid provided by the user is invalid. find_lively_task_by_vpid() does not return NULL on error but an error code. Without the fix the error code was silently passed to find_get_context() which would eventually cause a invalid pointer dereference. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: paulus@samba.org Cc: davem@davemloft.net Cc: fweisbec@gmail.com Cc: perfmon2-devel@lists.sf.net Cc: eranian@gmail.com Cc: robert.richter@amd.com LKML-Reference: <4ca9a5d1.e8e9d80a.3dbb.ffff8f2e@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:47:20 +02:00
Thomas Gleixner	21e2b8c62c	genirq: Provide compat handling for chip->retrigger() Wrap the old chip function retrigger() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121843.025801092@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:50 +02:00
Thomas Gleixner	2f7e99bb9b	genirq: Provide compat handling for chip->set_wake() Wrap the old chip function set_wake() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.927527393@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:48 +02:00
Thomas Gleixner	b2ba2c3003	genirq: Provide compat handling for chip->set_type() Wrap the old chip function set_type() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.832261548@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:47 +02:00
Thomas Gleixner	c96b3b3c44	genirq: Provide compat handling for chip->set_affinity() Wrap the old chip function set_affinity() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.732894108@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:46 +02:00
Thomas Gleixner	37e12df709	genirq: Provide compat handling for chip->startup() Wrap the old chip function startup() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.635152961@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:44 +02:00
Thomas Gleixner	bc310dda41	genirq: Provide compat handling for chip->disable()/shutdown() Wrap the old chip functions disable() and shutdown() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.532070631@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:43 +02:00
Thomas Gleixner	c5f756344c	genirq: Provide compat handling for chip->enable() Wrap the old chip function enable() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.437159182@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:42 +02:00
Thomas Gleixner	0c5c15572a	genirq: Provide compat handling for chip->eoi() Wrap the old chip function eoi() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.339657617@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:41 +02:00
Thomas Gleixner	9205e31d1a	genirq: Provide compat handling for chip->mask_ack() Wrap the old chip function mask_ack() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.240806983@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:40 +02:00
Thomas Gleixner	22a49163e9	genirq: Provide compat handling for chip->ack() Wrap the old chip function ack() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.142624725@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:38 +02:00
Thomas Gleixner	0eda58b7f3	genirq: Provide compat handling for chip->unmask() Wrap the old chip function unmask() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.043608928@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:37 +02:00
Thomas Gleixner	e2c0f8ff0f	genirq: Provide compat handling for chip->mask() Wrap the old chip function mask() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121841.940355859@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:36 +02:00
Thomas Gleixner	3876ec9ef3	genirq: Provide compat handling for bus_lock/bus_sync_unlock Wrap the old chip functions for bus_lock/bus_sync_unlock until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121841.842536121@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:35 +02:00
Thomas Gleixner	a77c463591	genirq: Add new functions to dummy chips The compat functions go away when the core code is converted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:43:34 +02:00
Thomas Gleixner	6b8ff3120c	genirq: Convert core code to irq_data Convert all references in the core code to orq, chip, handler_data, chip_data, msi_desc, affinity to irq_data.* Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:36:26 +02:00
Thomas Gleixner	ff7dcd44dd	genirq: Create irq_data Low level chip functions need access to irq_desc->handler_data, irq_desc->chip_data and irq_desc->msi_desc. We hand down the irq number to the low level functions, so they need to lookup irq_desc. With sparse irq this means a radix tree lookup. We could hand down irq_desc itself, but low level chip functions have no need to fiddle with it directly and we want to restrict access to irq_desc further. Preparatory patch for new chip functions. Note, that the ugly anon union/struct is there to avoid a full tree wide clean up for now. This is not going to last 3 years like __do_IRQ() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121841.645542300@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 12:27:16 +02:00
Thomas Gleixner	d9817ebeee	genirq: Provide Kconfig The generic irq Kconfig options are copied around all archs. Provide a generic Kconfig file which can be included. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121843.217333624@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>	2010-10-04 11:01:05 +02:00
Ira W. Snyder	399f1e30ac	kfifo: fix scatterlist usage The kfifo_dma family of functions use sg_mark_end() on the last element in their scatterlist. This forces use of a fresh scatterlist for each DMA operation, which makes recycling a single scatterlist impossible. Change the behavior of the kfifo_dma functions to match the usage of the dma_map_sg function. This means that users must respect the returned nents value. The sample code is updated to reflect the change. This bug is trivial to cause: call kfifo_dma_in_prepare() such that it prepares a scatterlist with a single entry comprising the whole fifo. This is the case when you map the entirety of a newly created empty fifo. This causes the setup_sgl() function to mark the first scatterlist entry as the end of the chain, no matter what comes after it. Afterwards, add and remove some data from the fifo such that another call to kfifo_dma_in_prepare() will create two scatterlist entries. It returns nents=2. However, due to the previous sg_mark_end() call, sg_is_last() will now return true for the first scatterlist element. This causes the sample code to print a single scatterlist element when it should print two. By removing the call to sg_mark_end(), we make the API as similar as possible to the DMA mapping API. All users are required to respect the returned nents. Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu> Cc: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-01 10:50:58 -07:00
Ingo Molnar	a5a2bad55d	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-09-24 09:12:05 +02:00
Thomas Gleixner	d1ea13c6e2	genirq: Cleanup irq_chip->typename leftovers 3 years transition phase is enough. Cleanup the last users and remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Leo Chen <leochen@broadcom.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Chris Zankel <chris@zankel.net>	2010-09-23 19:12:26 +02:00
Paul E. McKenney	269dcc1c2e	rcu: Add tracing data to support queueing models The current tracing data is not sufficient to deduce the average time that a callback spends waiting for a grace period to end. Add three per-CPU counters recording the number of callbacks invoked (ci), the number of callbacks orphaned (co), and the number of callbacks adopted (ca). Given the existing callback queue length (ql), the average wait time in absence of CPU hotplug operations is ql/ci. The units of wait time will be in terms of the duration over which ci was measured. In the presence of CPU hotplug operations, there is room for argument, but ql/(ci-co+ca) won't steer you too far wrong. Also fixes a typo called out by Lucas De Marchi <lucas.de.marchi@gmail.com>. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-09-23 09:16:53 -07:00
Paul E. McKenney	0ddea0ead2	rcu: fix sparse errors in rcutorture.c Add the sparse __rcu address-space identifier and make a couple of variables static. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-09-23 09:16:42 -07:00
Christian Dietrich	829f8ed2c9	kernel: Remove undead ifdef CONFIG_DEBUG_LOCK_ALLOC The CONFIG_DEBUG_LOCK_ALLOC ifdef isn't necessary at this point, because it is checked in an outer ifdef level already and has no effect here. Signed-off-by: Christian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-09-23 09:14:51 -07:00
Joe Perches	99a51792db	kernel/pm_qos_params.c: Remove unnecessary casts of private_data Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-09-23 13:33:46 +02:00
Stephen Rothwell	1ef21199a5	powerpc: define a compat_sys_recv cond_syscall Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-09-23 17:03:55 +10:00
Andrea Arcangeli	a247c3a97a	rmap: fix walk during fork The below bug in fork led to the rmap walk finding the parent huge-pmd twice instead of just once, because the anon_vma_chain objects of the child vma still point to the vma->vm_mm of the parent. The patch fixes it by making the rmap walk accurate during fork. It's not a big deal normally but it worth being accurate considering the cost is the same. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-09-22 17:22:39 -07:00
Jason Baron	8f7b50c514	jump label: Tracepoint support for jump labels Make use of the jump label infrastructure for tracepoints. Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <a9ba2056e2c9cf332c3c300b577463ce66ff23a8.1284733808.git.jbaron@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-22 16:31:01 -04:00
Jason Baron	4c3ef6d793	jump label: Add jump_label_text_reserved() to reserve jump points Add a jump_label_text_reserved(void start, void end), so that other pieces of code that want to modify kernel text, can first verify that jump label has not reserved the instruction. Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <06236663a3a7b1c1f13576bb9eccb6d9c17b7bfe.1284733808.git.jbaron@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-22 16:30:46 -04:00
Jason Baron	e0cf0cd496	jump label: Initialize workqueue tracepoints before they are registered Initialize the workqueue data structures before they are registered so that they are ready for callbacks. Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <e3a3383fc370ac7086625bebe89d9480d7caf372.1284733808.git.jbaron@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-22 16:30:03 -04:00
Jason Baron	bf5438fca2	jump label: Base patch for jump label base patch to implement 'jump labeling'. Based on a new 'asm goto' inline assembly gcc mechanism, we can now branch to labels from an 'asm goto' statment. This allows us to create a 'no-op' fastpath, which can subsequently be patched with a jump to the slowpath code. This is useful for code which might be rarely used, but which we'd like to be able to call, if needed. Tracepoints are the current usecase that these are being implemented for. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <ee8b3595967989fdaf84e698dc7447d315ce972a.1284733808.git.jbaron@redhat.com> [ cleaned up some formating ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-22 16:29:41 -04:00
Ingo Molnar	90edf27fb8	Merge branch 'linus' into perf/core Conflicts: kernel/hw_breakpoint.c Merge reason: resolve the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-22 18:45:01 +02:00
Thomas Gleixner	676cb02dc3	softirqs: Make wakeup_softirqd static No users outside of kernel/softirq.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-09-22 10:15:42 +02:00
Linus Torvalds	1ce1e41c1b	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix nohz balance kick sched: Fix user time incorrectly accounted as system time on 32-bit	2010-09-21 13:22:10 -07:00
Linus Torvalds	87ac6fa26e	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hw breakpoints: Fix pid namespace bug x86: Fix instruction breakpoint encoding oprofile: Add Support for Intel CPU Family 6 / Model 22 (Intel Celeron 540) kprobes: Fix Kconfig dependency	2010-09-21 13:21:42 -07:00
Steven Rostedt	a8027073eb	tracing/sched: Add sched_pi_setprio tracepoint Add a tracepoint that shows the priority of a task being boosted via priority inheritance. Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-21 10:56:41 -04:00
Steven Rostedt	b3bc211cfe	sched: Give CPU bound RT tasks preference If a high priority task is waking up on a CPU that is running a lower priority task that is bound to a CPU, see if we can move the high RT task to another CPU first. Note, if all other CPUs are running higher priority tasks than the CPU bounded current task, then it will be preempted regardless. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.888922071@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:57:12 +02:00
Steven Rostedt	43fa5460fe	sched: Try not to migrate higher priority RT tasks When first working on the RT scheduler design, we concentrated on keeping all CPUs running RT tasks instead of having multiple RT tasks on a single CPU waiting for the migration thread to move them. Instead we take a more proactive stance and push or pull RT tasks from one CPU to another on wakeup or scheduling. When an RT task wakes up on a CPU that is running another RT task, instead of preempting it and killing the cache of the running RT task, we look to see if we can migrate the RT task that is waking up, even if the RT task waking up is of higher priority. This may sound a bit odd, but RT tasks should be limited in migration by the user anyway. But in practice, people do not do this, which causes high prio RT tasks to bounce around the CPUs. This becomes even worse when we have priority inheritance, because a high prio task can block on a lower prio task and boost its priority. When the lower prio task wakes up the high prio task, if it happens to be on the same CPU it will migrate off of it. But in reality, the above does not happen much either, because the wake up of the lower prio task, which has already been boosted, if it was on the same CPU as the higher prio task, it would then migrate off of it. But anyway, we do not want to migrate them either. To examine the scheduling, I created a test program and examined it under kernelshark. The test program created CPU * 2 threads, where each thread had a different priority. The program takes different options. The options used in this change log was to have priority inheritance mutexes or not. All threads did the following loop: static void grab_lock(long id, int iter, int l) { ftrace_write("thread %ld iter %d, taking lock %d\n", id, iter, l); pthread_mutex_lock(&locks[l]); ftrace_write("thread %ld iter %d, took lock %d\n", id, iter, l); busy_loop(nr_tasks - id); ftrace_write("thread %ld iter %d, unlock lock %d\n", id, iter, l); pthread_mutex_unlock(&locks[l]); } void start_task(void id) { [...] while (!done) { for (l = 0; l < nr_locks; l++) { grab_lock(id, i, l); ftrace_write("thread %ld iter %d sleeping\n", id, i); ms_sleep(id); } i++; } [...] } The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes to the ftrace buffer to help analyze via ftrace. The higher the id, the higher the prio, the shorter it does the busy loop, but the longer it spins. This is usually the case with RT tasks, the lower priority tasks usually run longer than higher priority tasks. At the end of the test, it records the number of loops each thread took, as well as the number of voluntary preemptions, non-voluntary preemptions, and number of migrations each thread took, taking the information from /proc/$$/sched and /proc/$$/status. Running this on a 4 CPU processor, the results without changes to the kernel looked like this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 53 3220 1470 98 1: 562 773 724 98 2: 752 933 1375 98 3: 749 39 697 98 4: 758 5 515 98 5: 764 2 679 99 6: 761 2 535 99 7: 757 3 346 99 total: 5156 4977 6341 787 Each thread regardless of priority migrated a few hundred times. The higher priority tasks, were a little better but still took quite an impact. By letting higher priority tasks bump the lower prio task from the CPU, things changed a bit: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 37 2835 1937 98 1: 666 1821 1865 98 2: 654 1003 1385 98 3: 664 635 973 99 4: 698 197 352 99 5: 703 101 159 99 6: 708 1 75 99 7: 713 1 2 99 total: 4843 6594 6748 789 The total # of migrations did not change (several runs showed the difference all within the noise). But we now see a dramatic improvement to the higher priority tasks. (kernelshark showed that the watchdog timer bumped the highest priority task to give it the 2 count. This was actually consistent with every run). Notice that the # of iterations did not change either. The above was with priority inheritance mutexes. That is, when the higher prority task blocked on a lower priority task, the lower priority task would inherit the higher priority task (which shows why task 6 was bumped so many times). When not using priority inheritance mutexes, the current kernel shows this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 56 3101 1892 95 1: 594 713 937 95 2: 625 188 618 95 3: 628 4 491 96 4: 640 7 468 96 5: 631 2 501 96 6: 641 1 466 96 7: 643 2 497 96 total: 4458 4018 5870 765 Not much changed with or without priority inheritance mutexes. But if we let the high priority task bump lower priority tasks on wakeup we see: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 115 3439 2782 98 1: 633 1354 1583 99 2: 652 919 1218 99 3: 645 713 934 99 4: 690 3 3 99 5: 694 1 4 99 6: 720 3 4 99 7: 747 0 1 100 Which shows a even bigger change. The big difference between task 3 and task 4 is because we have only 4 CPUs on the machine, causing the 4 highest prio tasks to always have preference. Although I did not measure cache misses, and I'm sure there would be little to measure since the test was not data intensive, I could imagine large improvements for higher priority tasks when dealing with lower priority tasks. Thus, I'm satisfied with making the change and agreeing with what Gregory Haskins argued a few years ago when we first had this discussion. One final note. All tasks in the above tests were RT tasks. Any RT task will always preempt a non RT task that is running on the CPU the RT task wants to run on. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.605460343@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:57:12 +02:00
Venkatesh Pallipadi	58b26c4c02	sched: Increment cache_nice_tries only on periodic lb scheduler uses cache_nice_tries as an indicator to do cache_hot and active load balance, when normal load balance fails. Currently, this value is changed on any failed load balance attempt. That ends up being not so nice to workloads that enter/exit idle often, as they do more frequent new_idle balance and that pretty soon results in cache hot tasks being pulled in. Making the cache_nice_tries ignore failed new_idle balance seems to make better sense. With that only the failed load balance in periodic load balance gets accounted and the rate of accumulation of cache_nice_tries will not depend on idle entry/exit (short running sleep-wakeup kind of tasks). This reduces movement of cache_hot tasks. schedstat diff (after-before) excerpt from a workload that has frequent and short wakeup-idle pattern (:2 in cpu col below refers to NEWIDLE idx) This snapshot was across ~400 seconds. Without this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 306487 219575 73167 110069413 44583 19070 1172 218403 1:2 292139 194853 81421 120893383 50745 21902 1259 193594 2:2 283166 174607 91359 129699642 54931 23688 1287 173320 3:2 273998 161788 93991 132757146 57122 24351 1366 160422 4:2 289851 215692 62190 83398383 36377 13680 851 214841 5:2 316312 222146 77605 117582154 49948 20281 988 221158 6:2 297172 195596 83623 122133390 52801 21301 929 194667 7:2 283391 178078 86378 126622761 55122 22239 928 177150 8:2 297655 210359 72995 110246694 45798 19777 1125 209234 9:2 297357 202011 79363 119753474 50953 22088 1089 200922 10:2 278797 178703 83180 122514385 52969 22726 1128 177575 11:2 272661 167669 86978 127342327 55857 24342 1195 166474 12:2 293039 204031 73211 110282059 47285 19651 948 203083 13:2 289502 196762 76803 114712942 49339 20547 1016 195746 14:2 264446 169609 78292 115715605 50459 21017 982 168627 15:2 260968 163660 80142 116811793 51483 21281 1064 162596 With this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 272347 187380 77455 105420270 24975 1 953 186427 1:2 267276 172360 86234 116242264 28087 6 1028 171332 2:2 259769 156777 93281 123243134 30555 1 1043 155734 3:2 250870 143129 97627 127370868 32026 6 1188 141941 4:2 248422 177116 64096 78261112 22202 2 757 176359 5:2 275595 180683 84950 116075022 29400 6 778 179905 6:2 262418 162609 88944 119256898 31056 4 817 161792 7:2 252204 147946 92646 122388300 32879 4 824 147122 8:2 262335 172239 81631 110477214 26599 4 864 171375 9:2 261563 164775 88016 117203621 28331 3 849 163926 10:2 243389 140949 93379 121353071 29585 2 909 140040 11:2 242795 134651 98310 124768957 30895 2 1016 133635 12:2 255234 166622 79843 104696912 26483 4 746 165876 13:2 244944 151595 83855 109808099 27787 3 801 150794 14:2 241301 140982 89935 116954383 30403 6 845 140137 15:2 232271 128564 92821 119185207 31207 4 1416 127148 Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:57:11 +02:00
Ingo Molnar	cf84fd9632	Merge commit 'v2.6.36-rc5' into sched/core Merge reason: Pick up the latest fixes in -rc5. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:56:49 +02:00
Peter Zijlstra	41945f6ccf	perf: Avoid RCU vs preemption assumptions The per-pmu per-cpu context patch converted things from get_cpu_var() to this_cpu_ptr(), but that only works if rcu_read_lock() actually disables preemption, and since there is no such guarantee, we need to fix that. Use the newly introduced {get,put}_cpu_ptr(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> LKML-Reference: <20100917093009.308453028@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:55:44 +02:00
Ingo Molnar	7ed569206e	Merge commit 'v2.6.36-rc5' into perf/core Merge reason: Pick up the latest fixes in -rc5. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:55:11 +02:00
Suresh Siddha	f6c3f1686e	sched: Fix nohz balance kick There's a situation where the nohz balancer will try to wake itself: cpu-x is idle which is also ilb_cpu got a scheduler tick during idle and the nohz_kick_needed() in trigger_load_balance() checks for rq_x->nr_running which might not be zero (because of someone waking a task on this rq etc) and this leads to the situation of the cpu-x sending a kick to itself. And this can cause a lockup. Avoid this by not marking ourself eligible for kicking. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-21 13:50:50 +02:00
Tejun Heo	09383498c5	workqueue: implement flush[_delayed]_work_sync() Implement flush[_delayed]_work_sync(). These are flush functions which also make sure no CPU is still executing the target work from earlier queueing instances. These are similar to cancel[_delayed]_work_sync() except that the target work item is flushed instead of cancelled. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-09-19 17:51:05 +02:00
Tejun Heo	baf59022c3	workqueue: factor out start_flush_work() Factor out start_flush_work() from flush_work(). start_flush_work() has @wait_executing argument which controls whether the barrier is queued only if the work is pending or also if executing. As flush_work() needs to wait for execution too, it uses %true. This commit doesn't cause any behavior difference. start_flush_work() will be used to implement flush_work_sync(). Signed-off-by: Tejun Heo <tj@kernel.org>	2010-09-19 17:51:05 +02:00
Tejun Heo	401a8d048e	workqueue: cleanup flush/cancel functions Make the following cleanup changes. * Relocate flush/cancel function prototypes and definitions. * Relocate wait_on_cpu_work() and wait_on_work() before try_to_grab_pending(). These will be used to implement flush_work_sync(). * Make all flush/cancel functions return bool instead of int. * Update wait_on_cpu_work() and wait_on_work() to return %true if they actually waited. * Add / update comments. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-09-19 17:51:05 +02:00
Namhyung Kim	15e408cd6c	futex: Add lock context annotations queue_lock/unlock/me() and unqueue_me_pi() grab/release spinlocks but are missing proper annotations. Add them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <1284468228-8723-3-git-send-email-namhyung@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-09-18 12:19:21 +02:00
Namhyung Kim	a3c74c5257	futex: Mark restart_block.futex.uaddr[2] __user @uaddr and @uaddr2 fields in restart_block.futex are user pointers. Add __user and remove unnecessary casts. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <1284468228-8723-2-git-send-email-namhyung@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-09-18 12:19:21 +02:00
Namhyung Kim	1dcc41bb03	futex: Change 3rd arg of fetch_robust_entry() to unsigned int* Sparse complains: kernel/futex.c:2495:59: warning: incorrect type in argument 3 (different signedness) Make 3rd argument of fetch_robust_entry() 'unsigned int'. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <1284468228-8723-1-git-send-email-namhyung@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-09-18 12:19:21 +02:00
Peter Zijlstra	e9d2b06414	perf: Undo the per cpu-context timer stuff Revert the timer per cpu-context timers because of unfortunate nohz interaction. Fixing that would have been somewhat ugly, so go back to driving things from the regular tick. Provide a jiffies interval feature for people who want slower rotations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <20100917093009.519845633@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-17 12:48:48 +02:00
Peter Zijlstra	917bdd1c9b	perf: Fix perf_event_exit_cpu_context() Use the right cpu-context.. spotted by preempt warning on hot-unplug Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <20100917093009.461794357@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-17 12:48:48 +02:00
Peter Zijlstra	b04243ef70	perf: Complete software pmu grouping Aside from allowing software events into a !software group, allow adding !software events to pure software groups. Once we've moved the software group and attached the first !software event, the group will no longer be a pure software group and hence no longer be eligible for movement, at which point the straight ctx comparison is correct again. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20100917093009.410784731@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-17 12:48:48 +02:00
Stephane Eranian	d14b12d7ad	perf_events: Fix broken event grouping Events were not grouped anymore. The reason was that in perf_event_open(), the field event->group_leader was initialized before the function looked up the group_fd to find the event leader. This patch fixes this by reordering the code correctly. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <20100917093009.360420946@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-17 12:48:47 +02:00
Matt Helsley	068e35eee9	hw breakpoints: Fix pid namespace bug Hardware breakpoints can't be registered within pid namespaces because tsk->pid is passed rather than the pid in the current namespace. (See https://bugzilla.kernel.org/show_bug.cgi?id=17281 ) This is a quick fix demonstrating the problem but is not the best method of solving the problem since passing pids internally is not the best way to avoid pid namespace bugs. Subsequent patches will show a better solution. Much thanks to Frederic Weisbecker <fweisbec@gmail.com> for doing the bulk of the work finding this bug. Reported-by: Robin Green <greenrd@greenrd.org> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: 2.6.33-2.6.35 <stable@kernel.org> LKML-Reference: <f63454af09fb1915717251570423eb9ddd338340.1284407762.git.matthltc@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-09-17 04:42:59 +02:00
Linus Torvalds	94ca9d669a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add documentation	2010-09-16 12:50:31 -07:00
Heiko Carstens	31915ab4cb	sched: Remove branch hints within context_switch() With `710390d9` "sched: Optimize branch hint in context_switch()" the branch hint logic within context_switch() got inversed. In fact the hints "if (likely(!mm))" and "if (likely(!prev->mm))" mean that it is likely that the previous and next task are kernel threads. That assumption is certainly counter intuitive, but Tim has shown that at least with his workload this is true. Nevertheless the truth is: it depends on the current workload. So just remove the annotations which also improves readability. Reported-by: Tim Blechmann <tim@klingt.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20100916124225.GA2209@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-16 16:38:34 +02:00
Namhyung Kim	635c17c2b2	kprobes: Add sparse context annotations This removes following warnings when build with C=1 warning: context imbalance in 'kretprobe_hash_lock' - wrong count at exit warning: context imbalance in 'kretprobe_table_lock' - wrong count at exit warning: context imbalance in 'kretprobe_hash_unlock' - unexpected unlock warning: context imbalance in 'kretprobe_table_unlock' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> LKML-Reference: <1284512670-2369-6-git-send-email-namhyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:44:02 +02:00
Namhyung Kim	6376b22975	kprobes: Make functions static Make following (internal) functions static to make sparse happier :-) * get_optimized_kprobe: only called from static functions * kretprobe_table_unlock: _lock function is static * kprobes_optinsn_template_holder: never called but holding asm code Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> LKML-Reference: <1284512670-2369-4-git-send-email-namhyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:44:01 +02:00
Namhyung Kim	05662bdb64	kprobes: Verify jprobe entry point Verify jprobe's entry point is a function entry point using kallsyms' offset value. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> LKML-Reference: <1284512670-2369-3-git-send-email-namhyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:44:01 +02:00
Namhyung Kim	edbaadbe42	kprobes: Remove redundant address check Remove call to kernel_text_address() in register_jprobes() because it is called right after in register_kprobe(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> LKML-Reference: <1284512670-2369-2-git-send-email-namhyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:44:00 +02:00
Matt Helsley	38a81da220	perf events: Clean up pid passing The kernel perf event creation path shouldn't use find_task_by_vpid() because a vpid exists in a specific namespace. find_task_by_vpid() uses current's pid namespace which isn't always the correct namespace to use for the vpid in all the places perf_event_create_kernel_counter() (and thus find_get_context()) is called. The goal is to clean up pid namespace handling and prevent bugs like: https://bugzilla.kernel.org/show_bug.cgi?id=17281 Instead of using pids switch find_get_context() to use task struct pointers directly. The syscall is responsible for resolving the pid to a task struct. This moves the pid namespace resolution into the syscall much like every other syscall that takes pid parameters. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robin Green <greenrd@greenrd.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> LKML-Reference: <a134e5e392ab0204961fd1a62c84a222bf5874a9.1284407763.git.matthltc@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:44:00 +02:00
Matt Helsley	2ebd4ffb6d	perf events: Split out task search into helper Split out the code which searches for non-exiting tasks into its own helper. Creating this helper not only makes the code slightly more readable it prepares to move the search out of find_get_context() in a subsequent commit. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robin Green <greenrd@greenrd.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> LKML-Reference: <561205417b450b8a4bf7488374541d64b4690431.1284407762.git.matthltc@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:44:00 +02:00
Matt Helsley	d958077d00	hw breakpoints: Fix pid namespace bug Hardware breakpoints can't be registered within pid namespaces because tsk->pid is passed rather than the pid in the current namespace. (See https://bugzilla.kernel.org/show_bug.cgi?id=17281 ) This is a quick fix demonstrating the problem but is not the best method of solving the problem since passing pids internally is not the best way to avoid pid namespace bugs. Subsequent patches will show a better solution. Much thanks to Frederic Weisbecker <fweisbec@gmail.com> for doing the bulk of the work finding this bug. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robin Green <greenrd@greenrd.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> LKML-Reference: <f63454af09fb1915717251570423eb9ddd338340.1284407762.git.matthltc@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:43:59 +02:00
Stephane Eranian	d9ca07a05c	watchdog: Avoid kernel crash when disabling watchdog In case you boot with the watchdog disabled, i.e., nowatchdog, then, if you try to disable it via /proc/sys/kernel/watchdog, you get a kernel crash. The reason is that you are trying to cancel a hrtimer which has never been initialized. This patch fixes this by skipping execution of watchdog_disable_all_cpus() when the watchdog is marked disabled from boot. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4c8f7a23.cae9d80a.2c11.0bb4@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:43:58 +02:00
Stanislaw Gruszka	e75e863dd5	sched: Fix user time incorrectly accounted as system time on 32-bit We have 32-bit variable overflow possibility when multiply in task_times() and thread_group_times() functions. When the overflow happens then the scaled utime value becomes erroneously small and the scaled stime becomes i erroneously big. Reported here: https://bugzilla.redhat.com/show_bug.cgi?id=633037 https://bugzilla.kernel.org/show_bug.cgi?id=16559 Reported-by: Michael Chapman <redhat-bugzilla@very.puzzling.org> Reported-by: Ciriaco Garcia de Celis <sysman@etherpilot.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@kernel.org> # 2.6.32.19+ (partially) and 2.6.33+ LKML-Reference: <20100914143513.GB8415@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-15 10:41:36 +02:00
Ingo Molnar	3aabae7d9d	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-09-15 10:27:31 +02:00
Steven Rostedt	79e406d7b0	tracing: Remove leftover FTRACE_ENABLE/DISABLE_MCOUNT enums The enums for FTRACE_ENABLE_MCOUNT and FTRACE_DISABLE_MCOUNT were used as commands to ftrace_run_update_code(). But these commands were used by the old nasty ftrace daemon that has long been slain. This is a clean up patch to remove the references to these enums and simplify the code a little. Reported-by: Wu Zhangjin <wuzhangjin@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-14 22:19:46 -04:00
Steven Rostedt	b304d0441a	tracing: Do not trace in irq when funcgraph-irq option is zero When the function graph tracer funcgraph-irq option is zero, disable tracing in IRQs. This makes the option have two effects. 1) When reading the trace file, do not display the functions that happen in interrupt context (when detected) 2) [new] When recording a trace, skip those that are detected to be in interrupt by the 'in_irq()' function Note, in_irq() is updated at irq_enter() and irq_exit(). There are still functions that are recorded by the function graph tracer that is in interrupt context but outside the irq_enter/exit() routines. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-14 20:18:07 -04:00
Jiri Olsa	2bd16212b8	tracing: Add funcgraph-irq option for function graph tracer. It's handy to be able to disable the irq related output and not to have to jump over each irq related code, when you have no interrest in it. The option is by default enabled, so there's no change to current behaviour. It affects only the final output, so all the irq related data stay in the ring buffer. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <20100907145344.GC1912@jolsa.brq.redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-14 20:18:07 -04:00
H. Peter Anvin	c41d68a513	compat: Make compat_alloc_user_space() incorporate the access_ok() compat_alloc_user_space() expects the caller to independently call access_ok() to verify the returned area. A missing call could introduce problems on some architectures. This patch incorporates the access_ok() check into compat_alloc_user_space() and also adds a sanity check on the length. The existing compat_alloc_user_space() implementations are renamed arch_compat_alloc_user_space() and are used as part of the implementation of the new global function. This patch assumes NULL will cause __get_user()/__put_user() to either fail or access userspace on all architectures. This should be followed by checking the return value of compat_access_user_space() for NULL in the callers, at which time the access_ok() in the callers can also be removed. Reported-by: Ben Hawkes <hawkes@sota.gen.nz> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <jejb@parisc-linux.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: <stable@kernel.org>	2010-09-14 16:08:45 -07:00
Steven Rostedt	57c072c711	tracing: Fix reading of set_ftrace_filter across lists If we do: # cd /sys/kernel/debug # echo 'do_IRQ:traceon schedule:traceon sys_write:traceon' > \ set_ftrace_filter # cat set_ftrace_filter We get the following output: #### all functions enabled #### sys_write:traceon:unlimited schedule:traceon:unlimited do_IRQ:traceon:unlimited This outputs two lists. One is the fact that all functions are currently enabled for function tracing, the other has three probed functions, which happen to have 'traceon' as their commands. Currently, when reading the first list (functions enabled) the seq_file code will receive a "NULL" from the t_next() function causing it to exit early. This makes "read()" from userspace stop reading the code at this boarder. Although read is allowed to do this, some (broken) applications might consider this an end of file and stop early. This patch adds the start of the second list to t_next() when it finishes the first list. It is a simple change and gives the set_ftrace_filter file nicer reading ability. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-14 15:14:20 -04:00
Steven Rostedt	98c4fd046f	tracing: Keep track of set_ftrace_filter position and allow lseek again This patch keeps track of the index within the elements of set_ftrace_filter and if the position goes backwards, it nicely resets and starts from the beginning again. This allows for lseek and pread to work properly now. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-14 14:46:01 -04:00
Steven Rostedt	4aeb69672d	tracing: Replace typecasted void pointer in set_ftrace_filter code The set_ftrace_filter uses seq_file and reads from two lists. The pointer returned by t_next() can either be of type struct dyn_ftrace or struct ftrace_func_probe. If there is a bug (there was one) the wrong pointer may be used and the reference can cause an oops. This patch makes t_next() and friends only return the iterator structure which now has a pointer of type struct dyn_ftrace and struct ftrace_func_probe. The t_show() can now test if the pointer is NULL or not and if the pointer exists, it is guaranteed to be of the correct type. Now if there's a bug, only wrong data will be shown but not an oops. Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-14 11:42:30 -04:00
Steven Rostedt	2bccfffd15	tracing: Do not reset pos in set_ftrace_filter After the filtered functions are read, the probed functions are read from the hash in set_ftrace_filter. When the hashed probed functions are read, the pos passed in is reset. Instead of modifying the pos given to the read function, just record the pos where the filtered functions ended and subtract from that. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-14 11:42:29 -04:00
Mathieu Desnoyers	7740191cd9	sched: Fix string comparison in /proc/sched_features Fix incorrect handling of the following case: INTERACTIVE INTERACTIVE_SOMETHING_ELSE The comparison only checks up to each element's length. Changelog since v1: - Embellish using some Rostedtisms. [ mingo: ^^ == smaller and cleaner ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Lindgren <tony@atomide.com> LKML-Reference: <20100913214700.GB16118@Krystal> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-14 13:23:45 +02:00
Ingo Molnar	0bf377bbb0	sched: Improve latencies under load by decreasing minimum scheduling granularity Mathieu reported bad latencies with make -j10 kind of kbuild workloads - which is mostly caused by us scheduling with a too coarse granularity. Reduce the minimum granularity some more, to make sure we can meet the latency target. I got the following results (make -j10 kbuild load, average of 3 runs): vanilla: maximum latency: 38278.9 µs average latency: 7730.1 µs patched: maximum latency: 22702.1 µs average latency: 6684.8 µs Mathieu also measured it: \| \| * wakeup-latency.c (SIGEV_THREAD) with make -j10 \| \| - Mainline 2.6.35.2 kernel \| \| maximum latency: 45762.1 µs \| average latency: 7348.6 µs \| \| - With only Peter's smaller min_gran (shown below): \| \| maximum latency: 29100.6 µs \| average latency: 6684.1 µs \| Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <AANLkTi=8m4g01wZPacySoF7U0PevTNVgJoZZrHiUD-pN@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-13 20:17:11 +02:00
Peter Zijlstra	0c67b40872	perf: Fix free_event() With the context rework stuff we can actually end up freeing an event before it gets attached to a context. Reported-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-13 17:08:42 +02:00
Peter Zijlstra	cde8e88498	perf: Sanitize the RCU logic Simplify things and simply synchronize against two RCU variants for PMU unregister -- we don't care about performance, its module unload if anything. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-13 17:08:42 +02:00
Tejun Heo	c54fce6eff	workqueue: add documentation Update copyright notice and add Documentation/workqueue.txt. Randy Dunlap, Dave Chinner: misc fixes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-By: Florian Mickler <florian@mickler.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Dave Chinner <david@fromorbit.com>	2010-09-13 10:26:52 +02:00
Linus Torvalds	84e1d836ef	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Avoid hitting OOM during preallocation of memory PM QoS: Correct pr_debug() misuse and improve parameter checks PM: Prevent waiting forever on asynchronous resume after failing suspend	2010-09-11 15:50:53 -07:00
Rafael J. Wysocki	6715045ddc	PM / Hibernate: Avoid hitting OOM during preallocation of memory There is a problem in hibernate_preallocate_memory() that it calls preallocate_image_memory() with an argument that may be greater than the total number of available non-highmem memory pages. If that's the case, the OOM condition is guaranteed to trigger, which in turn can cause significant slowdown to occur during hibernation. To avoid that, make preallocate_image_memory() adjust its argument before calling preallocate_image_pages(), so that the total number of saveable non-highem pages left is not less than the minimum size of a hibernation image. Change hibernate_preallocate_memory() to try to allocate from highmem if the number of pages allocated by preallocate_image_memory() is too low. Modify free_unnecessary_pages() to take all possible memory allocation patterns into account. Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: M. Vefa Bicakci <bicave@superonline.com>	2010-09-11 21:03:53 +02:00
Linus Torvalds	aad1830e6b	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, tsc: Fix a preemption leak in restore_sched_clock_state() sched: Move sched_avg_update() to update_cpu_load()	2010-09-11 07:59:49 -07:00
mark gross	0109c2c48d	PM QoS: Correct pr_debug() misuse and improve parameter checks Correct some pr_debug() misuse and add a stronger parameter check to pm_qos_write() for the ASCII hex value case. Thanks to Dan Carpenter for pointing out the problem! Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-09-11 00:53:05 +02:00
Peter Zijlstra	e5f4d3394a	perf: Fix perf_init_event() We ought to return -ENOENT when non of the registered PMUs recognise the requested event. This fixes a boot crash that occurs if no PMU is available but the NMI watchdog tries to register an event. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-10 17:41:55 +02:00
Heiko Carstens	27c379f7f8	generic-ipi: Fix deadlock in __smp_call_function_single Just got my 6 way machine to a state where cpu 0 is in an endless loop within __smp_call_function_single. All other cpus are idle. The call trace on cpu 0 looks like this: __smp_call_function_single scheduler_tick update_process_times tick_sched_timer __run_hrtimer hrtimer_interrupt clock_comparator_work do_extint ext_int_handler ----> timer irq cpu_idle __smp_call_function_single() got called from nohz_balancer_kick() (inlined) with the remote cpu being 1, wait being 0 and the per cpu variable remote_sched_softirq_cb (call_single_data) of the current cpu (0). Then it loops forever when it tries to grab the lock of the call_single_data, since it is already locked and enqueued on cpu 0. My theory how this could have happened: for some reason the scheduler decided to call __smp_call_function_single() on it's own cpu, and sends an IPI to itself. The interrupt stays pending since IRQs are disabled. If then the hypervisor schedules the cpu away it might happen that upon rescheduling both the IPI and the timer IRQ are pending. If then interrupts are enabled again it depends which one gets scheduled first. If the timer interrupt gets delivered first we end up with the local deadlock as seen in the calltrace above. Let's make __smp_call_function_single() check if the target cpu is the current cpu and execute the function immediately just like smp_call_function_single does. That should prevent at least the scenario described here. It might also be that the scheduler is not supposed to call __smp_call_function_single with the remote cpu being the current cpu, but that is a different issue. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jens Axboe <jaxboe@fusionio.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20100910114729.GB2827@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-10 16:48:40 +02:00
Linus Torvalds	f2955b490b	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread perf symbols: Fix multiple initialization of symbol system perf: Fix CPU hotplug perf, trace: Fix module leak tracing/kprobe: Fix handling of C-unlike argument names tracing/kprobes: Fix handling of argument names perf probe: Fix handling of arguments names perf probe: Fix return probe support tracing/kprobe: Fix a memory leak in error case tracing: Do not allow llseek to set_ftrace_filter	2010-09-10 07:31:24 -07:00
Peter Zijlstra	cee010ec52	perf: Ensure we call add_event_to_ctx() with the right locks held Even though we call it from the inherit path, where the child is not yet accessible, we need to hold ctx->lock, add_event_to_ctx() assumes IRQs are disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-10 16:24:33 +02:00
Chris Wright	df09162550	tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without having properly started the iterator to iterate the hash. This case is degenerate and, as discovered by Robert Swiecki, can cause t_hash_show() to misuse a pointer. This causes a NULL ptr deref with possible security implications. Tracked as CVE-2010-3079. Cc: Robert Swiecki <swiecki@google.com> Cc: Eugene Teo <eugene@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-09 22:43:49 -04:00
Hugh Dickins	910321ea81	swap: revert special hibernation allocation Please revert 2.6.36-rc commit `d2997b1042` "hibernation: freeze swap at hibernation". It complicated matters by adding a second swap allocation path, just for hibernation; without in any way fixing the issue that it was intended to address - page reclaim after fixing the hibernation image might free swap from a page already imaged as swapcache, letting its swap be reallocated to store a different page of the image: resulting in data corruption if the imaged page were freed as clean then swapped back in. Pages freed to si->swap_map were still in danger of being reallocated by the alternative allocation path. I guess it inadvertently fixed slow SSD swap allocation for hibernation, as reported by Nigel Cunningham: by missing out the discards that occur on the usual swap allocation path; but that was unintentional, and needs a separate fix. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ondrej Zary <linux@rainbow-software.org> Cc: Andrea Gelmini <andrea.gelmini@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-09-09 18:57:25 -07:00
Jerome Marchand	1c24de60e5	kernel/groups.c: fix integer overflow in groups_search gid_t is a unsigned int. If group_info contains a gid greater than MAX_INT, groups_search() function may look on the wrong side of the search tree. This solves some unfair "permission denied" problems. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-09-09 18:57:24 -07:00
Michael S. Tsirkin	31583bb0cf	cgroups: fix API thinko Add cgroup_attach_task_all() The existing cgroup_attach_task_current_cg() API is called by a thread to attach another thread to all of its cgroups; this is unsuitable for cases where a privileged task wants to attach itself to the cgroups of a less privileged one, since the call must be made from the context of the target task. This patch adds a more generic cgroup_attach_task_all() API that allows both the source task and to-be-moved task to be specified. cgroup_attach_task_current_cg() becomes a specialization of the more generic new function. [menage@google.com: rewrote changelog] [akpm@linux-foundation.org: address reviewer comments] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ben Blum <bblum@google.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-09-09 18:57:23 -07:00
Peter Oberparleiter	85a0fdfd0f	gcov: fix null-pointer dereference for certain module types The gcov-kernel infrastructure expects that each object file is loaded only once. This may not be true, e.g. when loading multiple kernel modules which are linked to the same object file. As a result, loading such kernel modules will result in incorrect gcov results while unloading will cause a null-pointer dereference. This patch fixes these problems by changing the gcov-kernel infrastructure so that multiple profiling data sets can be associated with one debugfs entry. It applies to 2.6.36-rc1. Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Reported-by: Werner Spies <werner.spies@thalesgroup.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-09-09 18:57:23 -07:00
Peter Zijlstra	4e231c7962	perf: Fix up delayed_put_task_struct() I missed a perf_event_ctxp user when converting it to an array. Pull this last user into perf_event.c as well and fix it up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 21:07:09 +02:00
Miroslav Lichvar	8af3c153ba	ntp: Clamp PLL update interval Clamp update interval to reduce PLL gain with low sampling rate (e.g. intermittent network connection) to avoid instability. The clamp roughly corresponds to the loop time constant, it's 8 * poll interval for SHIFT_PLL 2 and 32 * poll interval for SHIFT_PLL 4. This gives good results without affecting the gain in normal conditions where ntpd skips only up to seven consecutive samples. Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Acked-by: john stultz <johnstul@us.ibm.com> LKML-Reference: <1283870626-9472-1-git-send-email-mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-09-09 20:48:37 +02:00
Peter Zijlstra	1b9a644fec	perf: Optimize context ops Assuming we don't mix events of different pmus onto a single context (with the exeption of software events inside a hardware group) we can now assume that all events on a particular context belong to the same pmu, hence we can disable the pmu for the entire context operations. This reduces the amount of hardware writes. The exception for swevents comes from the fact that the sw pmu disable is a nop. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:34 +02:00
Peter Zijlstra	89a1e18731	perf: Provide a separate task context for swevents Since software events are always schedulable, mixing them up with hardware events (who are not) can lead to funny scheduling oddities. Giving them their own context solves this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:34 +02:00
Peter Zijlstra	8dc85d5472	perf: Multiple task contexts Provide the infrastructure for multiple task contexts. A more flexible approach would have resulted in more pointer chases in the scheduling hot-paths. This approach has the limitation of a static number of task contexts. Since I expect most external PMUs to be system wide, or at least node wide (as per the intel uncore unit) they won't actually need a task context. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:33 +02:00
Peter Zijlstra	eb18447987	perf: Clean up perf_event_context allocation Unify the two perf_event_context allocation sites. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:33 +02:00
Peter Zijlstra	97dee4f320	perf: Move some code around Move all inherit code near each other. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:33 +02:00
Peter Zijlstra	108b02cfce	perf: Per-pmu-per-cpu contexts Allocate per-cpu contexts per pmu. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:32 +02:00
Peter Zijlstra	b5ab4cd563	perf: Per cpu-context rotation timer Give each cpu-context its own timer so that it is a self contained entity, this eases the way for per-pmu-per-cpu contexts as well as provides the basic infrastructure to allow different rotation times per pmu. Things to look at: - folding the tick and these TICK_NSEC timers - separate task context rotation Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:32 +02:00
Peter Zijlstra	b28ab83c59	perf: Remove the swevent hash-table from the cpu context Separate the swevent hash-table from the cpu_context bits in preparation for per pmu cpu contexts. This keeps the swevent hash a global entity. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:32 +02:00
Peter Zijlstra	c3f00c7027	perf: Separate find_get_context() from event initialization Separate find_get_context() from the event allocation and initialization so that we may make find_get_context() depend on the event pmu in a later patch. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:31 +02:00
Peter Zijlstra	15ac9a395a	perf: Remove the sysfs bits Neither the overcommit nor the reservation sysfs parameter were actually working, remove them as they'll only get in the way. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:31 +02:00
Peter Zijlstra	a4eaf7f146	perf: Rework the PMU methods Replace pmu::{enable,disable,start,stop,unthrottle} with pmu::{add,del,start,stop}, all of which take a flags argument. The new interface extends the capability to stop a counter while keeping it scheduled on the PMU. We replace the throttled state with the generic stopped state. This also allows us to efficiently stop/start counters over certain code paths (like IRQ handlers). It also allows scheduling a counter without it starting, allowing for a generic frozen state (useful for rotating stopped counters). The stopped state is implemented in two different ways, depending on how the architecture implemented the throttled state: 1) We disable the counter: a) the pmu has per-counter enable bits, we flip that b) we program a NOP event, preserving the counter state 2) We store the counter state and ignore all read/overflow events Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:30 +02:00
Peter Zijlstra	fa407f35e0	perf: Shrink hw_perf_event Use hw_perf_event::period_left instead of hw_perf_event::remaining and win back 8 bytes. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:30 +02:00
Peter Zijlstra	ad5133b703	perf: Default PMU ops Provide default implementations for the pmu txn methods, this allows us to remove some conditional code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:30 +02:00
Peter Zijlstra	33696fc0d1	perf: Per PMU disable Changes perf_disable() into perf_pmu_disable(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:29 +02:00
Peter Zijlstra	24cd7f54a0	perf: Reduce perf_disable() usage Since the current perf_disable() usage is only an optimization, remove it for now. This eases the removal of the __weak hw_perf_enable() interface. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:29 +02:00
Peter Zijlstra	9ed6060d28	perf: Unindent labels Fixup random annoying style bits. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:28 +02:00
Peter Zijlstra	b0a873ebbf	perf: Register PMU implementations Simple registration interface for struct pmu, this provides the infrastructure for removing all the weak functions. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:28 +02:00
Peter Zijlstra	51b0fe3954	perf: Deconstify struct pmu sed -ie 's/const struct pmu\>/struct pmu/g' `git grep -l "const struct pmu\>"` Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Yanmin <yanmin_zhang@linux.intel.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:46:27 +02:00
Heiko Carstens	01a08546af	sched: Add book scheduling domain On top of the SMT and MC scheduling domains this adds the BOOK scheduling domain. This is useful for NUMA like machines which do not have an interface which tells which piece of memory is attached to which node or where the hardware performs striping. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100831082844.253053798@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:41:20 +02:00
Heiko Carstens	f269893c57	sched: Merge cpu_to_core_group functions Merge and simplify the two cpu_to_core_group variants so that the resulting function follows the same pattern like cpu_to_phys_group. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100831082843.953617555@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:41:18 +02:00
Ingo Molnar	2aa61274ef	Merge branch 'perf/urgent' into perf/core Merge reason: Pick up pending fixes before applying dependent new changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:40:08 +02:00
Suresh Siddha	da2b71edd8	sched: Move sched_avg_update() to update_cpu_load() Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting called from scale_rt_power() (in the load balance context) which doesn't take rq->lock. Fix it by moving the sched_avg_update() to more appropriate update_cpu_load() where the CFS load gets updated as well. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:39:33 +02:00
Peter Zijlstra	5e11637e2c	perf: Fix CPU hotplug Since we have UP_PREPARE, we should also have UP_CANCELED. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:38:52 +02:00
Li Zefan	9cb627d5f3	perf, trace: Fix module leak Commit `1c024eca` (perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events) caused a module refcount leak. Reported-And-Tested-by: Avi Kivity <avi@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4C7E1F12.8030304@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-09 20:38:51 +02:00
Linus Torvalds	79637a41e4	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: gcc-4.6: kernel/*: Fix unused but set warnings mutex: Fix annotations to include it in kernel-locking docbook pid: make setpgid() system call use RCU read-side critical section MAINTAINERS: Add RCU's public git tree	2010-09-08 11:13:42 -07:00
Linus Torvalds	899edae615	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, x86: Try to handle unknown nmis with an enabled PMU perf, x86: Fix handle_irq return values perf, x86: Fix accidentally ack'ing a second event on intel perf counter oprofile, x86: fix init_sysfs() function stub lockup_detector: Sync touch_*_watchdog back to old semantics tracing: Fix a race in function profile oprofile, x86: fix init_sysfs error handling perf_events: Fix time tracking for events with pid != -1 and cpu != -1 perf: Initialize callchains roots's childen hits oprofile: fix crash when accessing freed task structs	2010-09-08 11:13:16 -07:00
Masami Hiramatsu	da34634fd3	tracing/kprobe: Fix handling of C-unlike argument names Check the argument name whether it is invalid (not C-like symbol name). This makes event format simple. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> LKML-Reference: <20100827113912.22882.62313.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-09-08 11:47:19 -03:00
Masami Hiramatsu	aba91595cf	tracing/kprobes: Fix handling of argument names Set "argN" name for each argument automatically if it has no specified name. Since dynamic trace event(kprobe_events) accepts special characters for its argument, its format can show those special characters (e.g. '$', '%', '+'). However, perf can't parse those format because of the character (especially '%') mess up the format. This sets "argX" name for those arguments if user omitted the argument names. E.g. # echo 'p do_fork %ax IP=%ip $stack' > tracing/kprobe_events # cat tracing/kprobe_events p:kprobes/p_do_fork_0 do_fork arg1=%ax IP=%ip arg3=$stack Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> LKML-Reference: <20100827113906.22882.59312.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-09-08 11:47:19 -03:00
Masami Hiramatsu	61a5273622	tracing/kprobe: Fix a memory leak in error case Fix a memory leak which happens when a field name conflicts with others. In error case, free_trace_probe() will free all arguments until nr_args, so this increments nr_args the begining of the loop instead of the end. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> LKML-Reference: <20100827113846.22882.12670.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-09-08 11:47:18 -03:00
Steven Rostedt	9c55cb12c1	tracing: Do not allow llseek to set_ftrace_filter Reading the file set_ftrace_filter does three things. 1) shows whether or not filters are set for the function tracer 2) shows what functions are set for the function tracer 3) shows what triggers are set on any functions 3 is independent from 1 and 2. The way this file currently works is that it is a state machine, and as you read it, it may change state. But this assumption breaks when you use lseek() on the file. The state machine gets out of sync and the t_show() may use the wrong pointer and cause a kernel oops. Luckily, this will only kill the app that does the lseek, but the app dies while holding a mutex. This prevents anyone else from using the set_ftrace_filter file (or any other function tracing file for that matter). A real fix for this is to rewrite the code, but that is too much for a -rc release or stable. This patch simply disables llseek on the set_ftrace_filter() file for now, and we can do the proper fix for the next major release. Reported-by: Robert Swiecki <swiecki@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Tavis Ormandy <taviso@google.com> Cc: Eugene Teo <eugene@redhat.com> Cc: vendor-sec@lst.de Cc: <stable@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-08 12:08:01 -04:00
Christian Dietrich	ed2d372c07	sched: Remove unnecessary #ifdef CONFIG_SMP The CONFIG_SMP ifdef isn't necessary at this point, because it is checked in an outer ifdef level already and has no effect here. Cleanup only, no functional effect. Signed-off-by: Christian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de> Cc: vamos-dev@i4.informatik.uni-erlangen.de Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Tejun Heo <tj@kernel.org> LKML-Reference: <7a3a39ef3f765a4473cb026b1f204059568a7098.1283782701.git.qy03fugy@stud.informatik.uni-erlangen.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-08 08:14:20 +02:00
Linus Torvalds	cd4d4fc413	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: use zalloc_cpumask_var() for gcwq->mayday_mask workqueue: fix GCWQ_DISASSOCIATED initialization workqueue: Add a workqueue chapter to the tracepoint docbook workqueue: fix cwq->nr_active underflow workqueue: improve destroy_workqueue() debuggability workqueue: mark lock acquisition on worker_maybe_bind_and_lock() workqueue: annotate lock context change workqueue: free rescuer on destroy_workqueue	2010-09-07 14:08:17 -07:00
Andi Kleen	b3bd3de66f	gcc-4.6: kernel/*: Fix unused but set warnings No real bugs I believe, just some dead code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: andi@firstfloor.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-05 14:36:58 +02:00
Randy Dunlap	ef5dc121d5	mutex: Fix annotations to include it in kernel-locking docbook Fix kernel-doc notation in linux/mutex.h and kernel/mutex.c, then add these 2 files to the kernel-locking docbook as the Mutex API reference chapter. Add one API function to mutex-design.txt and correct a typo in that file. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20100902154816.6cc2f9ad.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-03 08:19:51 +02:00
Paul E. McKenney	81a294c44e	rcu: fix _oddness handling of verbose stall warnings CONFIG_RCU_CPU_STALL_VERBOSE depends on CONFIG_TREE_PREEMPT_RCU, but rcu_bootup_announce_oddness() complains if CONFIG_RCU_CPU_STALL_VERBOSE is not set even in the case of CONFIG_TREE_RCU. This commit therefore fixes rcu_bootup_announce_oddness() to avoid insisting on impossibilities. Reported-by: Guy Martin <gmsoft@tuxicoman.be> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-09-02 16:15:30 -07:00
Rabin Vincent	09bfafac3e	ARM: 6314/1: ftrace: allow build without frame pointers on ARM With a new enough GCC, ARM function tracing can be supported without the need for frame pointers. This is essential for Thumb-2 support, since frame pointers aren't available then. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2010-09-02 15:24:53 +01:00
Steven Rostedt	f6195aa09e	ring-buffer: Place duplicate expression into a single function While discussing the strictness of the 80 character limit on the Kernel Summit Discussion mailing list, I showed examples that I broke that limit slightly with some algorithms. In discussing with John Linville, what looked better, I realized that two of the 80 char breaking culprits were an identical expression. As a clean up, this patch moves the identical expression into its own helper function and that is used instead. As a side effect, the offending code is now under the 80 character limit. :-) This clean up code also changes the expression from (A - B) - C to A - (B + C) This makes the code look a little nicer too. Cc: John W. Linville <linville@tuxdriver.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-09-01 12:23:12 -04:00
Don Zickus	68d3f1d810	lockup_detector: Sync touch__watchdog back to old semantics During my rewrite, the semantics of touch_nmi_watchdog and touch_softlockup_watchdog changed enough to break some drivers (mostly over preemptable regions). These are cases where long delays on one CPU (due to print_delay for example) can cause long delays on other CPUs - so we must 'touch' the nmi_watchdog flag of those other CPUs as well. This change brings those touch__watchdog() functions back in line with to how they used to work. Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: peterz@infradead.org Cc: fweisbec@gmail.com LKML-Reference: <1283310009-22168-2-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-01 10:02:28 +02:00
Akinobu Mita	14416c35b6	lockup_detector: Remove unused panic_notifier The panic notifer in lockup_detector just set did_panic to 1. But did_panic is not used anywhere so we can just remove it. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: fweisbec@gmail.com LKML-Reference: <1283310009-22168-4-git-send-email-dzickus@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-01 07:33:34 +02:00
Akinobu Mita	eac243355a	lockup_detector: Convert cpu notifier to return encapsulate errno value By the commit `e6bde73b07` ("cpu-hotplug: return better errno on cpu hotplug failure"), the cpu notifier can return encapsulate errno value, resulting in more meaningful error codes for CPU hotplug failures. This converts the cpu notifier to return encapsulate errno value for the lockup_detector as well. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: fweisbec@gmail.com LKML-Reference: <1283310009-22168-3-git-send-email-dzickus@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-09-01 07:33:34 +02:00
Paul E. McKenney	950eaaca68	pid: make setpgid() system call use RCU read-side critical section [ 23.584719] [ 23.584720] =================================================== [ 23.585059] [ INFO: suspicious rcu_dereference_check() usage. ] [ 23.585176] --------------------------------------------------- [ 23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection! [ 23.585176] [ 23.585176] other info that might help us debug this: [ 23.585176] [ 23.585176] [ 23.585176] rcu_scheduler_active = 1, debug_locks = 1 [ 23.585176] 1 lock held by rc.sysinit/728: [ 23.585176] #0: (tasklist_lock){.+.+..}, at: [<ffffffff8104771f>] sys_setpgid+0x5f/0x193 [ 23.585176] [ 23.585176] stack backtrace: [ 23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2 [ 23.585176] Call Trace: [ 23.585176] [<ffffffff8105b436>] lockdep_rcu_dereference+0x99/0xa2 [ 23.585176] [<ffffffff8104c324>] find_task_by_pid_ns+0x50/0x6a [ 23.585176] [<ffffffff8104c35b>] find_task_by_vpid+0x1d/0x1f [ 23.585176] [<ffffffff81047727>] sys_setpgid+0x67/0x193 [ 23.585176] [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b [ 24.959669] type=1400 audit(1282938522.956:4): avc: denied { module_request } for pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas It turns out that the setpgid() system call fails to enter an RCU read-side critical section before doing a PID-to-task_struct translation. This commit therefore does rcu_read_lock() before the translation, and also does rcu_read_unlock() after the last use of the returned pointer. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: David Howells <dhowells@redhat.com>	2010-08-31 17:00:18 -07:00
Li Zefan	3aaba20f26	tracing: Fix a race in function profile While we are reading trace_stat/functionX and someone just disabled function_profile at that time, we can trigger this: divide error: 0000 [#1] PREEMPT SMP ... EIP is at function_stat_show+0x90/0x230 ... This fix just takes the ftrace_profile_lock and checks if rec->counter is 0. If it's 0, we know the profile buffer has been reset. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: stable@kernel.org LKML-Reference: <4C723644.4040708@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-08-31 16:46:23 -04:00
Tejun Heo	9c37547ab6	workqueue: use zalloc_cpumask_var() for gcwq->mayday_mask alloc_mayday_mask() was using alloc_cpumask_var() making gcwq->mayday_mask contain garbage after initialization on CONFIG_CPUMASK_OFFSTACK=y configurations. This combined with the previously fixed GCWQ_DISASSOCIATED initialization bug could make rescuers fall into infinite loop trying to bind to an offline cpu. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: CAI Qian <caiqian@redhat.com>	2010-08-31 11:18:34 +02:00
Tejun Heo	477a3c33d1	workqueue: fix GCWQ_DISASSOCIATED initialization init_workqueues() incorrectly marks workqueues for all possible CPUs associated. Combined with mayday_mask initialization bug, this can make rescuers keep trying to bind to an offline gcwq indefinitely. Fix init_workqueues() such that only online CPUs have their gcwqs have GCWQ_DISASSOCIATED cleared. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: CAI Qian <caiqian@redhat.com>	2010-08-31 10:54:35 +02:00
Ingo Molnar	daab7fc734	Merge commit 'v2.6.36-rc3' into x86/memblock Conflicts: arch/x86/kernel/trampoline.c mm/memblock.c Merge reason: Resolve the conflicts, update to latest upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-31 09:45:46 +02:00
Stephane Eranian	fa66f07aa1	perf_events: Fix time tracking for events with pid != -1 and cpu != -1 Per-thread events with a cpu filter, i.e., cpu != -1, were not reporting correct timings when the thread never ran on the monitored cpu. The time enabled was reported as a negative value. This patch fixes the problem by updating tstamp_stopped, tstamp_running in event_sched_out() for events with filters and which are marked as INACTIVE. The function group_sched_out() is modified to systematically call into event_sched_out() to avoid duplicating the timing adjustment code twice. With the patch, I now get: $ task_cpu -i -e unhalted_core_cycles,unhalted_core_cycles noploop 2 noploop for 2 seconds CPU0 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU0 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU1 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU1 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU2 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU2 0 unhalted_core_cycles (ena=1,991,136,594, run=0) CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594) CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594) Signed-off-by: Stephane Eranian <eranian@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: davem@davemloft.net Cc: fweisbec@gmail.com Cc: perfmon2-devel@lists.sf.net Cc: eranian@google.com LKML-Reference: <4c76802d.aae9d80a.115d.70fe@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-30 12:16:55 +02:00
Linus Torvalds	6f4dbeca1a	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM QoS: Fix inline documentation. PM QoS: Fix kzalloc() parameters swapped in pm_qos_power_open()	2010-08-28 14:06:19 -07:00
Linus Torvalds	2637d139fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: pxa27x_keypad - remove input_free_device() in pxa27x_keypad_remove() Input: mousedev - fix regression of inverting axes Input: uinput - add devname alias to allow module on-demand load Input: hil_kbd - fix compile error USB: drop tty argument from usb_serial_handle_sysrq_char() Input: sysrq - drop tty argument form handle_sysrq() Input: sysrq - drop tty argument from sysrq ops handlers	2010-08-28 13:55:31 -07:00
Yinghai Lu	a587d2daeb	x86: Remove not used early_res code and some functions in e820.c that are not used anymore Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-08-27 11:13:51 -07:00
Paul E. McKenney	dd7c4d8973	rcu: performance fixes to TINY_PREEMPT_RCU callback checking This commit tightens up checks in rcu_preempt_check_callbacks() to avoid unnecessary special handling at rcu_read_unlock() time. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-27 10:51:17 -07:00
Frederic Weisbecker	98ee74a75c	Merge branch 'perf/urgent' into perf/core Conflicts: tools/perf/util/callchain.h Merge reason: Fix a non-trivial conflict with latest fixes	2010-08-27 02:30:07 +02:00
Saravana Kannan	25cc69ec34	PM QoS: Fix inline documentation. Fix the pm_qos_add_request() kerneldoc comment that doesn't reflect the behavior of the function after the last PM QoS update. Signed-off-by: Saravana Kannan <skannan@codeaurora.org> Acked-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-08-26 20:18:43 +02:00
Linus Torvalds	d4348c6789	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, x86, Pentium4: Clear the P4_CCCR_FORCE_OVF flag tracing/trace_stack: Fix stack trace on ppc64	2010-08-25 10:50:07 -07:00
Linus Torvalds	5e686019df	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states sched: Fix rq->clock synchronization when migrating tasks	2010-08-25 08:40:56 -07:00
Ingo Molnar	7de5d895b2	Merge branch 'linus' into perf/core Merge reason: pick up perf fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-25 13:10:00 +02:00
Anton Blanchard	151772dbfa	tracing/trace_stack: Fix stack trace on ppc64 save_stack_trace() stores the instruction pointer, not the function descriptor. On ppc64 the trace stack code currently dereferences the instruction pointer and shows 8 bytes of instructions in our backtraces: # cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (26 entries) ----- ---- -------- 0) 5424 112 0x6000000048000004 1) 5312 160 0x60000000ebad01b0 2) 5152 160 0x2c23000041c20030 3) 4992 240 0x600000007c781b79 4) 4752 160 0xe84100284800000c 5) 4592 192 0x600000002fa30000 6) 4400 256 0x7f1800347b7407e0 7) 4144 208 0xe89f0108f87f0070 8) 3936 272 0xe84100282fa30000 Since we aren't dealing with function descriptors, use %pS instead of %pF to fix it: # cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (26 entries) ----- ---- -------- 0) 5424 112 ftrace_call+0x4/0x8 1) 5312 160 .current_io_context+0x28/0x74 2) 5152 160 .get_io_context+0x48/0xa0 3) 4992 240 .cfq_set_request+0x94/0x4c4 4) 4752 160 .elv_set_request+0x60/0x84 5) 4592 192 .get_request+0x2d4/0x468 6) 4400 256 .get_request_wait+0x7c/0x258 7) 4144 208 .__make_request+0x49c/0x610 8) 3936 272 .generic_make_request+0x390/0x434 Signed-off-by: Anton Blanchard <anton@samba.org> Cc: rostedt@goodmis.org Cc: fweisbec@gmail.com LKML-Reference: <20100825013238.GE28360@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-25 13:08:48 +02:00
Tejun Heo	8a2e8e5dec	workqueue: fix cwq->nr_active underflow cwq->nr_active is used to keep track of how many work items are active for the cpu workqueue, where 'active' is defined as either pending on global worklist or executing. This is used to implement the max_active limit and workqueue freezing. If a work item is queued after nr_active has already reached max_active, the work item doesn't increment nr_active and is put on the delayed queue and gets activated later as previous active work items retire. try_to_grab_pending() which is used in the cancellation path unconditionally decremented nr_active whether the work item being cancelled is currently active or delayed, so cancelling a delayed work item makes nr_active underflow. This breaks max_active enforcement and triggers BUG_ON() in destroy_workqueue() later on. This patch fixes this bug by adding a flag WORK_STRUCT_DELAYED, which is set while a work item in on the delayed list and making try_to_grab_pending() decrement nr_active iff the work item is currently active. The addition of the flag enlarges cwq alignment to 256 bytes which is getting a bit too large. It's scheduled to be reduced back to 128 bytes by merging WORK_STRUCT_PENDING and WORK_STRUCT_CWQ in the next devel cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Johannes Berg <johannes@sipsolutions.net>	2010-08-25 10:33:56 +02:00
Linus Torvalds	502adf5778	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: watchdog: Don't throttle the watchdog tracing: Fix timer tracing	2010-08-24 12:21:49 -07:00
Linus Torvalds	3b6c5507a6	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mutex: Improve the scalability of optimistic spinning	2010-08-24 12:21:02 -07:00
David Alan Gilbert	bac1e74dba	PM QoS: Fix kzalloc() parameters swapped in pm_qos_power_open() sparse spotted that the kzalloc() in pm_qos_power_open() in the current Linus' git tree had its parameters swapped. Fix this. Signed-off-by: David Alan Gilbert <linux@treblig.org> Acked-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-08-24 20:22:18 +02:00
Tejun Heo	e41e704bc4	workqueue: improve destroy_workqueue() debuggability Now that the worklist is global, having works pending after wq destruction can easily lead to oops and destroy_workqueue() have several BUG_ON()s to catch these cases. Unfortunately, BUG_ON() doesn't tell much about how the work became pending after the final flush_workqueue(). This patch adds WQ_DYING which is set before the final flush begins. If a work is requested to be queued on a dying workqueue, WARN_ON_ONCE() is triggered and the request is ignored. This clearly indicates which caller is trying to queue a work on a dying workqueue and keeps the system working in most cases. Locking rule comment is updated such that the 'I' rule includes modifying the field from destruction path. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-08-24 18:01:32 +02:00
Namhyung Kim	972fa1c531	workqueue: mark lock acquisition on worker_maybe_bind_and_lock() worker_maybe_bind_and_lock() actually grabs gcwq->lock but was missing proper annotation. Add it. So this patch will remove following sparse warnings: kernel/workqueue.c:1214:13: warning: context imbalance in 'worker_maybe_bind_and_lock' - wrong count at exit arch/x86/include/asm/irqflags.h:44:9: warning: context imbalance in 'worker_rebind_fn' - unexpected unlock kernel/workqueue.c:1991:17: warning: context imbalance in 'rescuer_thread' - unexpected unlock Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-08-23 11:37:49 +02:00
Namhyung Kim	06bd6ebffa	workqueue: annotate lock context change Some of internal functions called within gcwq->lock context releases and regrabs the lock but were missing proper annotations. Add it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-08-23 11:37:49 +02:00
Ingo Molnar	a6b9b4d50f	Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu	2010-08-23 11:32:34 +02:00
Tim Chen	9d0f4dcc5c	mutex: Improve the scalability of optimistic spinning There is a scalability issue for current implementation of optimistic mutex spin in the kernel. It is found on a 8 node 64 core Nehalem-EX system (HT mode). The intention of the optimistic mutex spin is to busy wait and spin on a mutex if the owner of the mutex is running, in the hope that the mutex will be released soon and be acquired, without the thread trying to acquire mutex going to sleep. However, when we have a large number of threads, contending for the mutex, we could have the mutex grabbed by other thread, and then another ……, and we will keep spinning, wasting cpu cycles and adding to the contention. One possible fix is to quit spinning and put the current thread on wait-list if mutex lock switch to a new owner while we spin, indicating heavy contention (see the patch included). I did some testing on a 8 socket Nehalem-EX system with a total of 64 cores. Using Ingo's test-mutex program that creates/delete files with 256 threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up after putting in the mutex spin fix: ./mutex-test V 256 10 Ops/sec 2.6.34 62864 With fix 197200 Repeating the test with Aim7 fserver workload, again there is a speed up with the fix: Jobs/min 2.6.34 91657 With fix 149325 To look at the impact on the distribution of mutex acquisition time, I collected the mutex acquisition time on Aim7 fserver workload with some instrumentation. The average acquisition time is reduced by 48% and number of contentions reduced by 32%. #contentions Time to acquire mutex (cycles) 2.6.34 72973 44765791 With fix 49210 23067129 The histogram of mutex acquisition time is listed below. The acquisition time is in 2^bin cycles. We see that without the fix, the acquisition time is mostly around 2^26 cycles. With the fix, we the distribution get spread out a lot more towards the lower cycles, starting from 2^13. However, there is an increase of the tail distribution with the fix at 2^28 and 2^29 cycles. It seems a small price to pay for the reduced average acquisition time and also getting the cpu to do useful work. Mutex acquisition time distribution (acq time = 2^bin cycles): 2.6.34 With Fix bin #occurrence % #occurrence % 11 2 0.00% 120 0.24% 12 10 0.01% 790 1.61% 13 14 0.02% 2058 4.18% 14 86 0.12% 3378 6.86% 15 393 0.54% 4831 9.82% 16 710 0.97% 4893 9.94% 17 815 1.12% 4667 9.48% 18 790 1.08% 5147 10.46% 19 580 0.80% 6250 12.70% 20 429 0.59% 6870 13.96% 21 311 0.43% 1809 3.68% 22 255 0.35% 2305 4.68% 23 317 0.44% 916 1.86% 24 610 0.84% 233 0.47% 25 3128 4.29% 95 0.19% 26 63902 87.69% 122 0.25% 27 619 0.85% 286 0.58% 28 0 0.00% 3536 7.19% 29 0 0.00% 903 1.83% 30 0 0.00% 0 0.00% I've done similar experiments with 2.6.35 kernel on smaller boxes as well. One is on a dual-socket Westmere box (12 cores total, with HT). Another experiment is on an old dual-socket Core 2 box (4 cores total, no HT) On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test program with my mutex patch but no significant difference in aim7's fserver workload. On the 4-core Core 2 box, I see the difference with the patch for both mutex-test and aim7 fserver are negligible. So far, it seems like the patch has not caused regression on smaller systems. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # .35.x LKML-Reference: <1282168827.9542.72.camel@schen9-DESK> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-23 10:56:27 +02:00
Peter Zijlstra	c6db67cda7	watchdog: Don't throttle the watchdog Stephane reported that when the machine locks up, the regular ticks, which are responsible to resetting the throttle count, stop too. Hence the NMI watchdog can end up being throttled before it reports on the locked up state, and we end up being sad.. Cure this by having the watchdog overflow reset its own throttle count. Reported-by: Stephane Eranian <eranian@google.com> Tested-by: Stephane Eranian <eranian@google.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282215916.1926.4696.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-23 10:48:05 +02:00
Arjan van de Ven	e36c886a0f	workqueue: Add basic tracepoints to track workqueue execution With the introduction of the new unified work queue thread pools, we lost one feature: It's no longer possible to know which worker is causing the CPU to wake out of idle. The result is that PowerTOP now reports a lot of "kworker/a:b" instead of more readable results. This patch adds a pair of tracepoints to the new workqueue code, similar in style to the timer/hrtimer tracepoints. With this pair of tracepoints, the next PowerTOP can correctly report which work item caused the wakeup (and how long it took): Interrupt (43) i915 time 3.51ms wakeups 141 Work ieee80211_iface_work time 0.81ms wakeups 29 Work do_dbs_timer time 0.55ms wakeups 24 Process Xorg time 21.36ms wakeups 4 Timer sched_rt_period_timer time 0.01ms wakeups 1 Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-21 13:19:37 -07:00
Linus Torvalds	297c5eee37	mm: make the vma list be doubly linked It's a really simple list, and several of the users want to go backwards in it to find the previous vma. So rather than have to look up the previous entry with 'find_vma_prev()' or something similar, just make it doubly linked instead. Tested-by: Ian Campbell <ijc@hellion.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-21 08:49:21 -07:00
Dmitry Torokhov	f335397d17	Input: sysrq - drop tty argument form handle_sysrq() Sysrq operations do not accept tty argument anymore so no need to pass it to us. [Stephen Rothwell <sfr@canb.auug.org.au>: fix build breakage in drm code caused by sysrq using bool but not including linux/types.h] [Sachin Sant <sachinp@in.ibm.com>: fix build breakage in s390 keyboadr driver] Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2010-08-21 00:34:45 -07:00
Andrea Righi	b35de43b31	kfifo: implement missing __kfifo_skip_r() kfifo_skip() is currently broken, due to the missing of the internal helper function. Add it. Signed-off-by: Andrea Righi <arighi@develer.com> Cc: Greg KH <greg@kroah.com> Acked-by: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-20 09:34:54 -07:00
Paul E. McKenney	80dcf60e6b	rcu: apply TINY_PREEMPT_RCU read-side speedup to TREE_PREEMPT_RCU Replace one of the ACCESS_ONCE() calls in each of __rcu_read_lock() and __rcu_read_unlock() with barrier() as suggested by Steve Rostedt in order to avoid the potential compiler-optimization-induced bug noted by Mathieu Desnoyers. Located-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:17 -07:00
Paul E. McKenney	7b0b759b65	rcu: combine duplicate code, courtesy of CONFIG_PREEMPT_RCU The CONFIG_PREEMPT_RCU kernel configuration parameter was recently re-introduced, but as an indication of the type of RCU (preemptible vs. non-preemptible) instead of as selecting a given implementation. This commit uses CONFIG_PREEMPT_RCU to combine duplicate code from include/linux/rcutiny.h and include/linux/rcutree.h into include/linux/rcupdate.h. This commit also combines a few other pieces of duplicate code that have accumulated. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:16 -07:00
Paul E. McKenney	a3dc3fb161	rcu: repair code-duplication FIXMEs Combine the duplicate definitions of ULONG_CMP_GE(), ULONG_CMP_LT(), and rcu_preempt_depth() into include/linux/rcupdate.h. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:13 -07:00
Paul E. McKenney	53d84e004d	rcu: permit suppressing current grace period's CPU stall warnings When using a kernel debugger, a long sojourn in the debugger can get you lots of RCU CPU stall warnings once you resume. This might not be helpful, especially if you are using the system console. This patch therefore allows RCU CPU stall warnings to be suppressed, but only for the duration of the current set of grace periods. This differs from Jason's original patch in that it adds support for tiny RCU and preemptible RCU, and uses a slightly different method for suppressing the RCU CPU stall warning messages. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-20 09:00:12 -07:00
Paul E. McKenney	8cdd32a918	rcu: refer RCU CPU stall-warning victims to stallwarn.txt There is some documentation on RCU CPU stall warnings contained in Documentation/RCU/stallwarn.txt, but it will not be apparent to someone who runs into such a warning while under time pressure. This commit therefore adds comments preceding the printk()s pointing out the location of this documentation. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-08-20 09:00:11 -07:00

... 3 4 5 6 7 ...

10668 Commits