Commit Graph

8405 Commits

Author SHA1 Message Date
Steven Rostedt
c7b0930857 ring-buffer: prevent adding write in discarded area
This a very tight race where an interrupt could come in and not
have enough data to put into the end of a buffer page, and that
it would fail to write and need to go to the next page.

But if this happened when another writer was about to reserver
their data, and that writer has smaller data to reserve, then
it could succeed even though the interrupt moved the tail page.

To pervent that, if we fail to store data, and by subtracting the
amount we reserved we still have room for smaller data, we need
to fill that space with "discarded" data.

[ Impact: prevent race were buffer data may be lost ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-15 11:37:19 -04:00
Li Zefan
0ac2058f68 tracing/filters: strloc should be unsigned short
I forgot to update filter code accordingly in
"tracing/events: change the type of __str_loc_item to unsigned short"
(commt b0aae68cc5)

It can cause system crash:

 # echo 1 > tracing/events/irq/irq_handler_entry/enable
 # echo 'name == eth0' > tracing/events/irq/irq_handler_entry/filter

[ Impact: fix crash while filtering on __string() field ]

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A35B905.3090500@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-15 11:37:18 -04:00
Li Zefan
5e4904cb63 tracing/filters: operand can be negative
This should be a bug:

 # cat format
 name: foo_bar
 ID: 71
 format:
	 ...
         field:int bar;  offset:24;      size:4;
 # echo 'bar < 0' > filter
 # echo 'bar < -1' > filter
 bash: echo: write error: Invalid argument

[ Impact: fix to allow negative operand in filer expr ]

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A35B8DF.60400@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-15 11:37:16 -04:00
Li Zefan
e4f2d10f47 tracing: replace a GFP_ATOMIC with GFP_KERNEL allocation
Atomic allocation is not needed here.

[ Impact: clean up of memory alloction type ]

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A35B898.2050607@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-15 11:37:14 -04:00
Li Zefan
215368e8e5 tracing: fix a typo in tracing_cpumask_write()
It's tracing_cpumask_new that should be kfree()ed.

This causes tracing_cpumask to be freed due to the typo:

 # echo z > tracing_cpumask
 bash: echo: write error: Invalid argument

And subsequent reads/writes to tracing_cpuamsk will access this
already-freed tracing_cpumask, thus may lead to crash.

[ Impact: fix leak and crash when writing invalid val to tracing_cpumask ]

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A35B86A.7070608@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-15 11:37:12 -04:00
Rusty Russell
3f237a79dd cpumask: use new operators in kernel/trace
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <200906122115.30787.rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-15 11:36:42 -04:00
Lennart Poettering
ca94c44253 sched: Introduce SCHED_RESET_ON_FORK scheduling policy flag
This patch introduces a new flag SCHED_RESET_ON_FORK which can be passed
to the kernel via sched_setscheduler(), ORed in the policy parameter. If
set this will make sure that when the process forks a) the scheduling
priority is reset to DEFAULT_PRIO if it was higher and b) the scheduling
policy is reset to SCHED_NORMAL if it was either SCHED_FIFO or SCHED_RR.

Why have this?

Currently, if a process is real-time scheduled this will 'leak' to all
its child processes. For security reasons it is often (always?) a good
idea to make sure that if a process acquires RT scheduling this is
confined to this process and only this process. More specifically this
makes the per-process resource limit RLIMIT_RTTIME useful for security
purposes, because it makes it impossible to use a fork bomb to
circumvent the per-process RLIMIT_RTTIME accounting.

This feature is also useful for tools like 'renice' which can then
change the nice level of a process without having this spill to all its
child processes.

Why expose this via sched_setscheduler() and not other syscalls such as
prctl() or sched_setparam()?

prctl() does not take a pid parameter. Due to that it would be
impossible to modify this flag for other processes than the current one.

The struct passed to sched_setparam() can unfortunately not be extended
without breaking compatibility, since sched_setparam() lacks a size
parameter.

How to use this from userspace? In your RT program simply replace this:

  sched_setscheduler(pid, SCHED_FIFO, &param);

by this:

  sched_setscheduler(pid, SCHED_FIFO|SCHED_RESET_ON_FORK, &param);

Signed-off-by: Lennart Poettering <lennart@poettering.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090615152714.GA29092@tango.0pointer.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-15 17:31:59 +02:00
Peter Zijlstra
75f937f24b perf_counter: Fix ctx->mutex vs counter->mutex inversion
Simon triggered a lockdep inversion report about us taking ctx->mutex
vs counter->mutex in inverse orders. Fix that up.

Reported-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Tested-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-15 15:57:49 +02:00
Vegard Nossum
722f2a6c87 Merge commit 'linus/master' into HEAD
Conflicts:
	MAINTAINERS

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 15:50:49 +02:00
Vegard Nossum
7a0aeb14e1 signal: fix __send_signal() false positive kmemcheck warning
This false positive is due to field padding in struct sigqueue. When
this dynamically allocated structure is copied to the stack (in arch-
specific delivery code), kmemcheck sees a read from the padding, which
is, naturally, uninitialized.

Hide the false positive using the __GFP_NOTRACK_FALSE_POSITIVE flag.
Also made the rlimit override code a bit clearer by introducing a new
variable.

Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 15:49:43 +02:00
Vegard Nossum
1744a21d57 trace: annotate bitfields in struct ring_buffer_event
This gets rid of a heap of false-positive warnings from the tracer
code due to the use of bitfields.

[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 15:49:37 +02:00
Vegard Nossum
2dff440525 kmemcheck: add mm functions
With kmemcheck enabled, the slab allocator needs to do this:

1. Tell kmemcheck to allocate the shadow memory which stores the status of
   each byte in the allocation proper, e.g. whether it is initialized or
   uninitialized.
2. Tell kmemcheck which parts of memory that should be marked uninitialized.
   There are actually a few more states, such as "not yet allocated" and
   "recently freed".

If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
memory that can take page faults because of kmemcheck.

If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
request memory with the __GFP_NOTRACK flag. This does not prevent the page
faults from occuring, however, but marks the object in question as being
initialized so that no warnings will ever be produced for this object.

In addition to (and in contrast to) __GFP_NOTRACK, the
__GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
not be tracked _because_ it would produce a false positive. Their values
are identical, but need not be so in the future (for example, we could now
enable/disable false positives with a config option).

Parts of this patch were contributed by Pekka Enberg but merged for
atomicity.

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-15 12:40:03 +02:00
Linus Torvalds
45e3e1935e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-next: (53 commits)
  .gitignore: ignore *.lzma files
  kbuild: add generic --set-str option to scripts/config
  kbuild: simplify argument loop in scripts/config
  kbuild: handle non-existing options in scripts/config
  kallsyms: generalize text region handling
  kallsyms: support kernel symbols in Blackfin on-chip memory
  documentation: make version fix
  kbuild: fix a compile warning
  gitignore: Add GNU GLOBAL files to top .gitignore
  kbuild: fix delay in setlocalversion on readonly source
  README: fix misleading pointer to the defconf directory
  vmlinux.lds.h update
  kernel-doc: cleanup perl script
  Improve vmlinux.lds.h support for arch specific linker scripts
  kbuild: fix headers_exports with boolean expression
  kbuild/headers_check: refine extern check
  kbuild: fix "Argument list too long" error for "make headers_check",
  ignore *.patch files
  Remove bashisms from scripts
  menu: fix embedded menu presentation
  ...
2009-06-14 14:12:18 -07:00
Linus Torvalds
489f7ab6c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (31 commits)
  trivial: remove the trivial patch monkey's name from SubmittingPatches
  trivial: Fix a typo in comment of addrconf_dad_start()
  trivial: usb: fix missing space typo in doc
  trivial: pci hotplug: adding __init/__exit macros to sgi_hotplug
  trivial: Remove the hyphen from git commands
  trivial: fix ETIMEOUT -> ETIMEDOUT typos
  trivial: Kconfig: .ko is normally not included in module names
  trivial: SubmittingPatches: fix typo
  trivial: Documentation/dell_rbu.txt: fix typos
  trivial: Fix Pavel's address in MAINTAINERS
  trivial: ftrace:fix description of trace directory
  trivial: unnecessary (void*) cast removal in sound/oss/msnd.c
  trivial: input/misc: Fix typo in Kconfig
  trivial: fix grammo in bus_for_each_dev() kerneldoc
  trivial: rbtree.txt: fix rb_entry() parameters in sample code
  trivial: spelling fix in ppc code comments
  trivial: fix typo in bio_alloc kernel doc
  trivial: Documentation/rbtree.txt: cleanup kerneldoc of rbtree.txt
  trivial: Miscellaneous documentation typo fixes
  trivial: fix typo milisecond/millisecond for documentation and source comments.
  ...
2009-06-14 13:46:25 -07:00
Linus Torvalds
a2ee2981ae Merge branch 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (80 commits)
  x86, mce: Add boot options for corrected errors
  x86, mce: Fix mce printing
  x86, mce: fix for mce counters
  x86, mce: support action-optional machine checks
  x86, mce: define MCE_VECTOR
  x86, mce: rename mce_notify_user to mce_notify_irq
  x86: fix panic with interrupts off (needed for MCE)
  x86, mce: export MCE severities coverage via debugfs
  x86, mce: implement new status bits
  x86, mce: print header/footer only once for multiple MCEs
  x86, mce: default to panic timeout for machine checks
  x86, mce: improve mce_get_rip
  x86, mce: make non Monarch panic message "Fatal machine check" too
  x86, mce: switch x86 machine check handler to Monarch election.
  x86, mce: implement panic synchronization
  x86, mce: implement bootstrapping for machine check wakeups
  x86, mce: check early in exception handler if panic is needed
  x86, mce: add table driven machine check grading
  x86, mce: remove TSC print heuristic
  x86, mce: log corrected errors when panicing
  ...
2009-06-13 13:14:51 -07:00
Vegard Nossum
dfec072ecd kmemcheck: add the kmemcheck core
General description: kmemcheck is a patch to the linux kernel that
detects use of uninitialized memory. It does this by trapping every
read and write to memory that was allocated dynamically (e.g. using
kmalloc()). If a memory address is read that has not previously been
written to, a message is printed to the kernel log.

Thanks to Andi Kleen for the set_memory_4k() solution.

Andrew Morton suggested documenting the shadow member of struct page.

Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>

[export kmemcheck_mark_initialized]
[build fix for setup_max_cpus]
Signed-off-by: Ingo Molnar <mingo@elte.hu>

[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
2009-06-13 15:37:30 +02:00
Marti Raudsepp
d5e8da6449 perf_counter: Fix stack corruption in perf_read_hw
With PERF_FORMAT_ID, perf_read_hw now needs space for up to 4 values.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-13 12:58:24 +02:00
Paul Mackerras
87847b8f26 perf_counter: Fix atomic_set vs. atomic64_t type mismatch
Using atomic_set on an atomic64_t variable gives a compiler
warning on powerpc, and won't give the desired result at runtime.
This fixes an instance of this error in the perf_counter code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <18995.20490.979429.244883@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-13 12:58:24 +02:00
Thomas Gleixner
cd6d95d844 clocksource: prevent selection of low resolution clocksourse also for nohz=on
commit 3f68535ada (clocksource: sanity check sysfs clocksource
changes) prevents selection of non high resolution capable
clocksources when high resolution mode is active, but did not take
into account that the same rules apply for highres=off nohz=on.

Check the tick device mode instead of hrtimer_hres_active() to verify
whether the system needs to be protected from a switch to jiffies or
other non highres capable clock sources.

Reported-by: Luming Yu <luming.yu@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-06-13 12:00:26 +02:00
Vegard Nossum
7c692cbade tasklets: new tasklet scheduling function
Rationale: kmemcheck needs to be able to schedule a tasklet without
touching any dynamically allocated memory _at_ _all_ (since that would
lead to a recursive page fault). This tasklet is used for writing the
error reports to the kernel log.

The new scheduling function avoids touching any other tasklets by
inserting the new tasklist as the head of the "tasklet_hi" list instead
of on the tail.

Also don't wake up the softirq thread lest the scheduler access some
tracked memory and we go down with a recursive page fault.

In this case, we'd better just wait for the maximum time of 1/HZ for the
message to appear.

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2009-06-13 10:02:24 +02:00
Linus Torvalds
947ec0b0c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: Add empty suspend/resume device irq functions
  PM/Hibernate: Move NVS routines into a seperate file (v2).
  PM/Hibernate: Rename disk.c to hibernate.c
  PM: Separate suspend to RAM functionality from core
  Driver Core: Rework platform suspend/resume, print warning
  PM: Remove device_type suspend()/resume()
  PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
  PM/Suspend: Do not shrink memory before suspend
  PM: Remove bus_type suspend_late()/resume_early() V2
  PM core: rename suspend and resume functions
  PM: Rename device_power_down/up()
  PM: Remove unused asm/suspend.h
  x86: unify power/cpu_(32|64).c
  x86: unify power/cpu_(32|64) copyright notes
  x86: unify power/cpu_(32|64) regarding restoring processor state
  x86: unify power/cpu_(32|64) regarding saving processor state
  x86: unify power/cpu_(32|64) global variables
  x86: unify power/cpu_(32|64) headers
  PM: Warn if interrupts are enabled during suspend-resume of sysdevs
  PM/ACPI/x86: Fix sparse warning in arch/x86/kernel/acpi/sleep.c
2009-06-12 13:17:27 -07:00
Linus Torvalds
4ddbac9898 Merge branch 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf_counter: Start documenting HAVE_PERF_COUNTERS requirements
  perf_counter: Add forward/backward attribute ABI compatibility
  perf record: Explicity program a default counter
  perf_counter: Remove PERF_TYPE_RAW special casing
  perf_counter: PERF_TYPE_HW_CACHE is a hardware counter too
  powerpc, perf_counter: Fix performance counter event types
  perf_counter/x86: Add a quirk for Atom processors
  perf_counter tools: Remove one L1-data alias
2009-06-12 13:16:52 -07:00
Cornelia Huck
fce2b111fa PM/Hibernate: Move NVS routines into a seperate file (v2).
The *_nvs_* routines in swsusp.c make use of the io*map()
functions, which are only provided for HAS_IOMEM, thus
breaking compilation if HAS_IOMEM is not set. Fix this
by moving the *_nvs_* routines into hibernate_nvs.c, which
is only compiled if HAS_IOMEM is set.

[rjw: Change the name of the new file to hibernate_nvs.c, add the
 license line to the header comment.]

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-06-12 21:32:33 +02:00
Rafael J. Wysocki
8b759b84c8 PM/Hibernate: Rename disk.c to hibernate.c
Change the name of kernel/power/disk.c to kernel/power/hibernate.c
in analogy with the file names introduced by the changes that
separated the suspend to RAM and standby funtionality from the
common PM functions.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
2009-06-12 21:32:33 +02:00
Rafael J. Wysocki
a9d7052363 PM: Separate suspend to RAM functionality from core
Move the suspend to RAM and standby code from kernel/power/main.c
to two separate files, kernel/power/suspend.c containing the basic
functions and kernel/power/suspend_test.c containing the automatic
suspend test facility based on the RTC clock alarm.

There are no changes in functionality related to these modifications.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
2009-06-12 21:32:33 +02:00
Rafael J. Wysocki
fe419535d8 PM/Hibernate: Move memory shrinking to snapshot.c (rev. 2)
A future patch is going to modify the memory shrinking code so that
it will make memory allocations to free memory instead of using an
artificial memory shrinking mechanism for that.  For this purpose it
is convenient to move swsusp_shrink_memory() from
kernel/power/swsusp.c to kernel/power/snapshot.c, because the new
memory-shrinking code is going to use things that are local to
kernel/power/snapshot.c .

[rev. 2: Make some functions static and remove their headers from
 kernel/power/power.h]

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
2009-06-12 21:32:32 +02:00
Rafael J. Wysocki
c6f37f1219 PM/Suspend: Do not shrink memory before suspend
Remove the shrinking of memory from the suspend-to-RAM code, where
it is not really necessary.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
2009-06-12 21:32:32 +02:00
Alan Stern
d161630297 PM core: rename suspend and resume functions
This patch (as1241) renames a bunch of functions in the PM core.
Rather than go through a boring list of name changes, suffice it to
say that in the end we have a bunch of pairs of functions:

	device_resume_noirq	dpm_resume_noirq
	device_resume		dpm_resume
	device_complete		dpm_complete
	device_suspend_noirq	dpm_suspend_noirq
	device_suspend		dpm_suspend
	device_prepare		dpm_prepare

in which device_X does the X operation on a single device and dpm_X
invokes device_X for all devices in the dpm_list.

In addition, the old dpm_power_up and device_resume_noirq have been
combined into a single function (dpm_resume_noirq).

Lastly, dpm_suspend_start and dpm_resume_end are the renamed versions
of the former top-level device_suspend and device_resume routines.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-06-12 21:32:31 +02:00
Magnus Damm
e39a71ef80 PM: Rename device_power_down/up()
Rename the functions performing "_noirq" dev_pm_ops
operations from device_power_down() and device_power_up()
to device_suspend_noirq() and device_resume_noirq().

The new function names are chosen to show that the functions
are responsible for calling the _noirq() versions to finalize
the suspend/resume operation. The current function names do
not perform power down/up anymore so the names may be misleading.

Global function renames:
- device_power_down() -> device_suspend_noirq()
- device_power_up() -> device_resume_noirq()

Static function renames:
- suspend_device_noirq() -> __device_suspend_noirq()
- resume_device_noirq() -> __device_resume_noirq()

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-06-12 21:32:31 +02:00
Linus Torvalds
6d21491838 Merge branch 'topic/slab/earlyboot-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'topic/slab/earlyboot-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slab: setup cpu caches later on when interrupts are enabled
  slab,slub: don't enable interrupts during early boot
  slab: fix gfp flag in setup_cpu_cache()
  x86: make zap_low_mapping could be used early
  irq: slab alloc for default irq_affinity
  memcg: fix page_cgroup fatal error in FLATMEM
2009-06-12 09:52:30 -07:00
Linus Torvalds
7f3591cfac Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: (31 commits)
  lguest: add support for indirect ring entries
  lguest: suppress notifications in example Launcher
  lguest: try to batch interrupts on network receive
  lguest: avoid sending interrupts to Guest when no activity occurs.
  lguest: implement deferred interrupts in example Launcher
  lguest: remove obsolete LHREQ_BREAK call
  lguest: have example Launcher service all devices in separate threads
  lguest: use eventfds for device notification
  eventfd: export eventfd_signal and eventfd_fget for lguest
  lguest: allow any process to send interrupts
  lguest: PAE fixes
  lguest: PAE support
  lguest: Add support for kvm_hypercall4()
  lguest: replace hypercall name LHCALL_SET_PMD with LHCALL_SET_PGD
  lguest: use native_set_* macros, which properly handle 64-bit entries when PAE is activated
  lguest: map switcher with executable page table entries
  lguest: fix writev returning short on console output
  lguest: clean up length-used value in example launcher
  lguest: Segment selectors are 16-bit long. Fix lg_cpu.ss1 definition.
  lguest: beyond ARRAY_SIZE of cpu->arch.gdt
  ...
2009-06-12 09:32:26 -07:00
Jean Delvare
3ac49a1c99 trivial: fix ETIMEOUT -> ETIMEDOUT typos
fix ETIMEOUT -> ETIMEDOUT typos

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-12 18:01:50 +02:00
Manish Katiyar
1dc492a0a4 trivial: kernel/power/poweroff.c: whitespace fix
Fix coding style whitespace fixes. Patch compile tested
Before :-
total: 1 errors, 0 warnings, 46 lines checked
After
total: 0 errors, 0 warnings, 46 lines checked

Before :-
  text	   data	    bss	    dec	    hex	filename
    107	     48	      0	    155	     9b	kernel/power/poweroff.o
After
   text	   data	    bss	    dec	    hex	filename
    107	     48	      0	    155	     9b	kernel/power/poweroff.o

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-12 18:01:46 +02:00
Rusty Russell
b43e352139 sched: export kick_process
lguest needs kick_process: wake_up_process() does nothing if a process
is running, which isn't sufficient (we need it in the kernel).

And lguest support is usually modular.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
2009-06-12 22:27:01 +09:30
Peter Zijlstra
974802eaa1 perf_counter: Add forward/backward attribute ABI compatibility
Provide for means of extending the perf_counter_attr in a 'natural' way.

We allow growing the structure by appending fields at the end by specifying
the full structure size inside it.

When a new kernel sees a smaller (old) structure, it will 0 pad the tail.
When an old kernel sees a larger (new) structure, it will verify the tail
consists of 0s, otherwise fail.

If we fail due to a size-mismatch, we return -E2BIG and write the kernel's
native attribe size back into the provided structure.

Furthermore, add some attribute verification, so that we'll fail counter
creation when unknown bits are present (PERF_SAMPLE, PERF_FORMAT, or in
the __reserved fields).

(This ABI detail is introduced while keeping the existing syscall ABI.)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-12 14:28:52 +02:00
Peter Zijlstra
081fad8617 perf_counter: Remove PERF_TYPE_RAW special casing
The PERF_TYPE_RAW special case seems superfluous these days. Remove
it and add it to the switch() stmt like the others.

[ Impact: cleanup ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-12 14:28:51 +02:00
Rusty Russell
ad6561dffa module: trim exception table on init free.
It's theoretically possible that there are exception table entries
which point into the (freed) init text of modules.  These could cause
future problems if other modules get loaded into that memory and cause
an exception as we'd see the wrong fixup.  The only case I know of is
kvm-intel.ko (when CONFIG_CC_OPTIMIZE_FOR_SIZE=n).

Amerigo fixed this long-standing FIXME in the x86 version, but this
patch is more general.

This implements trim_init_extable(); most archs are simple since they
use the standard lib/extable.c sort code.  Alpha and IA64 use relative
addresses in their fixups, so thier trimming is a slight variation.

Sparc32 is unique; it doesn't seem to define ARCH_HAS_SORT_EXTABLE,
yet it defines its own sort_extable() which overrides the one in lib.
It doesn't sort, so we have to mark deleted entries instead of
actually trimming them.

Inspired-by: Amerigo Wang <amwang@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: linux-alpha@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
2009-06-12 21:47:04 +09:30
Rusty Russell
fddd520122 module_param: allow 'bool' module_params to be bool, not just int.
Impact: API cleanup

For historical reasons, 'bool' parameters must be an int, not a bool.
But there are around 600 users, so a conversion seems like useless churn.

So we use __same_type() to distinguish, and handle both cases.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 21:46:58 +09:30
Rusty Russell
45fcc70c0b module_param: split perm field into flags and perm
Impact: cleanup

Rather than hack KPARAM_KMALLOCED into the perm field, separate it out.
Since the perm field was 32 bits and only needs 16, we don't add bloat.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 21:46:56 +09:30
Rusty Russell
9a71af2c36 module_param: invbool should take a 'bool', not an 'int'
It takes an 'int' for historical reasons, and there are only two
users: simply switch it over to bool.

The other user (uvesafb.c) will get a (harmless-on-x86) warning until
the next patch is applied.

Cc: Brad Douglas <brad@neruo.com>
Cc: Michal Januszewski <spock@gentoo.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-12 21:46:56 +09:30
Yinghai Lu
28be225b23 irq: slab alloc for default irq_affinity
Ingo had

[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: at mm/bootmem.c:537 alloc_arch_preferred_bootmem+0x2b/0x71()
[    0.000000] Hardware name: System Product Name
[    0.000000] Modules linked in:
[    0.000000] Pid: 0, comm: swapper Tainted: G        W  2.6.30-tip-03087-g0bb2618-dirty #52506
[    0.000000] Call Trace:
[    0.000000]  [<81032588>] warn_slowpath_common+0x60/0x90
[    0.000000]  [<810325c5>] warn_slowpath_null+0xd/0x10
[    0.000000]  [<819d1bc0>] alloc_arch_preferred_bootmem+0x2b/0x71
[    0.000000]  [<819d1c31>] ___alloc_bootmem_nopanic+0x2b/0x9a
[    0.000000]  [<81050a0a>] ? lock_release+0xac/0xb2
[    0.000000]  [<819d1d4c>] ___alloc_bootmem+0xe/0x2d
[    0.000000]  [<819d1e9f>] __alloc_bootmem+0xa/0xc
[    0.000000]  [<819d7c63>] alloc_bootmem_cpumask_var+0x21/0x26
[    0.000000]  [<819d0cc8>] early_irq_init+0x15/0x10d
[    0.000000]  [<819bb75a>] start_kernel+0x167/0x326
[    0.000000]  [<819bb06b>] __init_begin+0x6b/0x70
[    0.000000] ---[ end trace 4eaa2a86a8e2da23 ]---
[    0.000000] NR_IRQS:2304 nr_irqs:424
[    0.000000] CPU 0 irqstacks, hard=821e6000 soft=821e7000

we need to update init_irq_default_affinity

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-12 13:50:23 +03:00
Alessio Igor Bogani
337eb00a2c Push BKL down into ->remount_fs()
[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Al Viro
589ff870ed Switch collect_mounts() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Ingo Molnar
0d5959723e Merge branch 'linus' into x86/mce3
Conflicts:
	arch/x86/kernel/cpu/mcheck/mce_64.c
	arch/x86/kernel/irq.c

Merge reason: Resolve the conflicts above.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 23:31:52 +02:00
Linus Torvalds
512626a04e Merge branch 'for-linus' of git://linux-arm.org/linux-2.6
* 'for-linus' of git://linux-arm.org/linux-2.6:
  kmemleak: Add the corresponding MAINTAINERS entry
  kmemleak: Simple testing module for kmemleak
  kmemleak: Enable the building of the memory leak detector
  kmemleak: Remove some of the kmemleak false positives
  kmemleak: Add modules support
  kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
  kmemleak: Add the vmalloc memory allocation/freeing hooks
  kmemleak: Add the slub memory allocation/freeing hooks
  kmemleak: Add the slob memory allocation/freeing hooks
  kmemleak: Add the slab memory allocation/freeing hooks
  kmemleak: Add documentation on the memory leak detector
  kmemleak: Add the base support

Manual conflict resolution (with the slab/earlyboot changes) in:
	drivers/char/vt.c
	init/main.c
	mm/slab.c
2009-06-11 14:15:57 -07:00
Linus Torvalds
8a1ca8cedd Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
  perf_counter: Turn off by default
  perf_counter: Add counter->id to the throttle event
  perf_counter: Better align code
  perf_counter: Rename L2 to LL cache
  perf_counter: Standardize event names
  perf_counter: Rename enums
  perf_counter tools: Clean up u64 usage
  perf_counter: Rename perf_counter_limit sysctl
  perf_counter: More paranoia settings
  perf_counter: powerpc: Implement generalized cache events for POWER processors
  perf_counters: powerpc: Add support for POWER7 processors
  perf_counter: Accurate period data
  perf_counter: Introduce struct for sample data
  perf_counter tools: Normalize data using per sample period data
  perf_counter: Annotate exit ctx recursion
  perf_counter tools: Propagate signals properly
  perf_counter tools: Small frequency related fixes
  perf_counter: More aggressive frequency adjustment
  perf_counter/x86: Fix the model number of Intel Core2 processors
  perf_counter, x86: Correct some event and umask values for Intel processors
  ...
2009-06-11 14:01:07 -07:00
Linus Torvalds
b640f042fa Merge branch 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  vgacon: use slab allocator instead of the bootmem allocator
  irq: use kcalloc() instead of the bootmem allocator
  sched: use slab in cpupri_init()
  sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()
  memcg: don't use bootmem allocator in setup code
  irq/cpumask: make memoryless node zero happy
  x86: remove some alloc_bootmem_cpumask_var calling
  vt: use kzalloc() instead of the bootmem allocator
  sched: use kzalloc() instead of the bootmem allocator
  init: introduce mm_init()
  vmalloc: use kzalloc() instead of alloc_bootmem()
  slab: setup allocators earlier in the boot sequence
  bootmem: fix slab fallback on numa
  bootmem: use slab if bootmem is no longer available
2009-06-11 12:25:06 -07:00
Oleg Nesterov
b415c49a86 slow_work_thread() should do the exclusive wait
slow_work_thread() sleeps on slow_work_thread_wq without WQ_FLAG_EXCLUSIVE,
this means that slow_work_enqueue()->__wake_up(nr_exclusive => 1) wakes up all
kslowd threads.  This is not what we want, so we change slow_work_thread() to
use prepare_to_wait_exclusive() instead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-11 11:26:38 -07:00
Linus Torvalds
c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Linus Torvalds
d3d07d941f Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (266 commits)
  sh: Tie sparseirq in to Kconfig.
  sh: Wire up sys_rt_tgsigqueueinfo.
  sh: Fix sys_pwritev() syscall table entry for sh32.
  sh: Fix sh4a llsc-based cmpxchg()
  sh: sh7724: Add JPU support
  sh: sh7724: INTC setting update
  sh: sh7722 clock framework rewrite
  sh: sh7366 clock framework rewrite
  sh: sh7343 clock framework rewrite
  sh: sh7724 clock framework rewrite V3
  sh: sh7723 clock framework rewrite V2
  sh: add enable()/disable()/set_rate() to div6 code
  sh: add AP325RXA mode pin configuration
  sh: add Migo-R mode pin configuration
  sh: sh7722 mode pin definitions
  sh: sh7724 mode pin comments
  sh: sh7723 mode pin V2
  sh: rework mode pin code
  sh: clock div6 helper code
  sh: clock div4 frequency table offset fix
  ...
2009-06-11 10:08:33 -07:00
Linus Torvalds
3296ca27f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits)
  nommu: Provide mmap_min_addr definition.
  TOMOYO: Add description of lists and structures.
  TOMOYO: Remove unused field.
  integrity: ima audit dentry_open failure
  TOMOYO: Remove unused parameter.
  security: use mmap_min_addr indepedently of security models
  TOMOYO: Simplify policy reader.
  TOMOYO: Remove redundant markers.
  SELinux: define audit permissions for audit tree netlink messages
  TOMOYO: Remove unused mutex.
  tomoyo: avoid get+put of task_struct
  smack: Remove redundant initialization.
  integrity: nfsd imbalance bug fix
  rootplug: Remove redundant initialization.
  smack: do not beyond ARRAY_SIZE of data
  integrity: move ima_counts_get
  integrity: path_check update
  IMA: Add __init notation to ima functions
  IMA: Minimal IMA policy and boot param for TCB IMA policy
  selinux: remove obsolete read buffer limit from sel_read_bool
  ...
2009-06-11 10:01:41 -07:00
Pekka Enberg
22fb4e71e6 irq: use kcalloc() instead of the bootmem allocator
Fixes the following problem:

[    0.000000] Experimental hierarchical RCU init done.
[    0.000000] NR_IRQS:4352 nr_irqs:256
[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: at mm/bootmem.c:537 alloc_arch_preferred_bootmem+0x40/0x7e()
[    0.000000] Hardware name: To Be Filled By O.E.M.
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #59709
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff823f8c8e>] ? alloc_arch_preferred_bootmem+0x40/0x7e
[    0.000000]  [<ffffffff81067168>] warn_slowpath_common+0x88/0xcb
[    0.000000]  [<ffffffff810671d2>] warn_slowpath_null+0x27/0x3d
[    0.000000]  [<ffffffff823f8c8e>] alloc_arch_preferred_bootmem+0x40/0x7e
[    0.000000]  [<ffffffff823f9307>] ___alloc_bootmem_nopanic+0x4e/0xec
[    0.000000]  [<ffffffff823f93c5>] ___alloc_bootmem+0x20/0x61
[    0.000000]  [<ffffffff823f962e>] __alloc_bootmem+0x1e/0x34
[    0.000000]  [<ffffffff823f757c>] early_irq_init+0x6d/0x118
[    0.000000]  [<ffffffff823e0140>] ? early_idt_handler+0x0/0x71
[    0.000000]  [<ffffffff823e0cf7>] start_kernel+0x192/0x394
[    0.000000]  [<ffffffff823e0140>] ? early_idt_handler+0x0/0x71
[    0.000000]  [<ffffffff823e02ad>] x86_64_start_reservations+0xb4/0xcf
[    0.000000]  [<ffffffff823e0000>] ? __init_begin+0x0/0x140
[    0.000000]  [<ffffffff823e0420>] x86_64_start_kernel+0x158/0x17b
[    0.000000] ---[ end trace a7919e7f17c0a725 ]---
[    0.000000] Fast TSC calibration using PIT
[    0.000000] Detected 2002.510 MHz processor.
[    0.004000] Console: colour VGA+ 80x25

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 19:27:13 +03:00
Pekka Enberg
0fb5302916 sched: use slab in cpupri_init()
Lets not use the bootmem allocator in cpupri_init() as slab is already up when
it is run.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 19:27:12 +03:00
Pekka Enberg
4bdddf8ff9 sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()
Slab is initialized when sched_init() runs now so lets use alloc_cpumask_var().

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 19:27:11 +03:00
Yinghai Lu
dad213aeb5 irq/cpumask: make memoryless node zero happy
Don't hardcode to node zero for early boot IRQ setup memory allocations.

[ penberg@cs.helsinki.fi: minor cleanups ]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 19:27:08 +03:00
Yinghai Lu
38c7fed2f5 x86: remove some alloc_bootmem_cpumask_var calling
Now that we set up the slab allocator earlier, we can get rid of some
alloc_bootmem_cpumask_var() calls in boot code.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 19:27:07 +03:00
Pekka Enberg
36b7b6d465 sched: use kzalloc() instead of the bootmem allocator
Now that kmem_cache_init() happens before sched_init(), we should use kzalloc()
and not the bootmem allocator.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11 19:27:04 +03:00
Catalin Marinas
4f2294b6dc kmemleak: Add modules support
This patch handles the kmemleak operations needed for modules loading so
that memory allocations from inside a module are properly tracked.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-11 17:03:31 +01:00
Ingo Molnar
940010c5a3 Merge branch 'linus' into perfcounters/core
Conflicts:
	arch/x86/kernel/irqinit.c
	arch/x86/kernel/irqinit_64.c
	arch/x86/kernel/traps.c
	arch/x86/mm/fault.c
	include/linux/sched.h
	kernel/exit.c
2009-06-11 17:55:42 +02:00
Peter Zijlstra
cca3f454a8 perf_counter: Add counter->id to the throttle event
So as to be able to distuinguish between multiple counters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 17:54:45 +02:00
Peter Zijlstra
f4dbfa8f31 perf_counter: Standardize event names
Pure renames only, to PERF_COUNT_HW_* and PERF_COUNT_SW_*.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 17:54:15 +02:00
Peter Zijlstra
1c432d899d perf_counter: Rename enums
Rename the perf enums to be in the 'perf_' namespace and strictly
enumerate the ABI bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 17:53:41 +02:00
Peter Zijlstra
df58ab24bf perf_counter: Rename perf_counter_limit sysctl
Rename perf_counter_limit to perf_counter_max_sample_rate and
prohibit creation of counters with a known higher sample
frequency.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 16:48:38 +02:00
Peter Zijlstra
0764771dab perf_counter: More paranoia settings
Rename the perf_counter_priv knob to perf_counter_paranoia (because
priv can be read as private, as opposed to privileged) and provide
one more level:

 0 - permissive
 1 - restrict cpu counters to privilidged contexts
 2 - restrict kernel-mode code counting and profiling

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 16:48:38 +02:00
john stultz
3f68535ada clocksource: sanity check sysfs clocksource changes
Thomas, Andrew and Ingo pointed out that we don't have any safety checks
in the clocksource sysfs entries to make sure sysadmins don't try to
change the clocksource to a non high-res timer capable clocksource (such
as jiffies) when high-res timers (HRT) is enabled.  Doing so will likely
hang a system.

Correct this by filtering non HRT clocksources from available_clocksources
and not accepting non HRT clocksources with HRT enabled.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-06-11 11:24:52 +02:00
Paul Mundt
cf9fe114e3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-11 09:01:14 +03:00
Linus Torvalds
991ec02cdc Merge branch 'tracing-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  function-graph: always initialize task ret_stack
  function-graph: move initialization of new tasks up in fork
  function-graph: add memory barriers for accessing task's ret_stack
  function-graph: enable the stack after initialization of other variables
  function-graph: only allocate init tasks if it was not already done

Manually fix trivial conflict in kernel/trace/ftrace.c
2009-06-10 19:58:10 -07:00
Linus Torvalds
8623661180 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
  Revert "x86, bts: reenable ptrace branch trace support"
  tracing: do not translate event helper macros in print format
  ftrace/documentation: fix typo in function grapher name
  tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
  tracing: add protection around module events unload
  tracing: add trace_seq_vprint interface
  tracing: fix the block trace points print size
  tracing/events: convert block trace points to TRACE_EVENT()
  ring-buffer: fix ret in rb_add_time_stamp
  ring-buffer: pass in lockdep class key for reader_lock
  tracing: add annotation to what type of stack trace is recorded
  tracing: fix multiple use of __print_flags and __print_symbolic
  tracing/events: fix output format of user stack
  tracing/events: fix output format of kernel stack
  tracing/trace_stack: fix the number of entries in the header
  ring-buffer: discard timestamps that are at the start of the buffer
  ring-buffer: try to discard unneeded timestamps
  ring-buffer: fix bug in ring_buffer_discard_commit
  ftrace: do not profile functions when disabled
  tracing: make trace pipe recognize latency format flag
  ...
2009-06-10 19:53:40 -07:00
Linus Torvalds
8f40642ad3 Merge branch 'signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: hookup sys_rt_tgsigqueueinfo
  signals: implement sys_rt_tgsigqueueinfo
  signals: split do_tkill
2009-06-10 19:50:52 -07:00
Linus Torvalds
20f3f3ca49 Merge branch 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcu: rcu_sched_grace_period(): kill the bogus flush_signals()
  rculist: use list_entry_rcu in places where it's appropriate
  rculist.h: introduce list_entry_rcu() and list_first_entry_rcu()
  rcu: Update RCU tracing documentation for __rcu_pending
  rcu: Add __rcu_pending tracing to hierarchical RCU
  RCU: make treercu be default
2009-06-10 19:50:03 -07:00
James Morris
73fbad283c Merge branch 'next' into for-linus 2009-06-11 11:03:14 +10:00
Peter Zijlstra
9e350de37a perf_counter: Accurate period data
We currently log hw.sample_period for PERF_SAMPLE_PERIOD, however this is
incorrect. When we adjust the period, it will only take effect the next
cycle but report it for the current cycle. So when we adjust the period
for every cycle, we're always wrong.

Solve this by keeping track of the last_period.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 02:39:02 +02:00
Peter Zijlstra
df1a132bf3 perf_counter: Introduce struct for sample data
For easy extension of the sample data, put it in a structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 02:39:02 +02:00
Peter Zijlstra
66fff22483 perf_counter: Annotate exit ctx recursion
Ever since Paul fixed it to unclone the context before taking the
ctx->lock this became a false positive, annotate it away.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-11 02:39:01 +02:00
Linus Torvalds
e7241d7714 Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  spinlock: Add missing __raw_spin_lock_flags() stub for UP
  mutex: add atomic_dec_and_mutex_lock(), fix
  locking, rtmutex.c: Documentation cleanup
  mutex: add atomic_dec_and_mutex_lock()
2009-06-10 16:19:40 -07:00
Linus Torvalds
75063600fd Merge branch 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: fix restart in wait_requeue_pi
  futex: fix restart for early wakeup in futex_wait_requeue_pi()
  futex: cleanup error exit
  futex: remove the wait queue
  futex: add requeue-pi documentation
  futex: remove FUTEX_REQUEUE_PI (non CMP)
  futex: fix futex_wait_setup key handling
  sparc64: extend TI_RESTART_BLOCK space by 8 bytes
  futex: fixup unlocked requeue pi case
  futex: add requeue_pi functionality
  futex: split out futex value validation code
  futex: distangle futex_requeue()
  futex: add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
  rt_mutex: add proxy lock routines
  futex: split out fixup owner logic from futex_lock_pi()
  futex: split out atomic logic from futex_lock_pi()
  futex: add helper to find the top prio waiter of a futex
  futex: separate futex_wait_queue_me() logic from futex_wait()
2009-06-10 16:16:48 -07:00
Linus Torvalds
be15f9d63b Merge branch 'x86-xen-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-xen-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (42 commits)
  xen: cache cr0 value to avoid trap'n'emulate for read_cr0
  xen/x86-64: clean up warnings about IST-using traps
  xen/x86-64: fix breakpoints and hardware watchpoints
  xen: reserve Xen start_info rather than e820 reserving
  xen: add FIX_TEXT_POKE to fixmap
  lguest: update lazy mmu changes to match lguest's use of kvm hypercalls
  xen: honour VCPU availability on boot
  xen: add "capabilities" file
  xen: drop kexec bits from /sys/hypervisor since kexec isn't implemented yet
  xen/sys/hypervisor: change writable_pt to features
  xen: add /sys/hypervisor support
  xen/xenbus: export xenbus_dev_changed
  xen: use device model for suspending xenbus devices
  xen: remove suspend_cancel hook
  xen/dev-evtchn: clean up locking in evtchn
  xen: export ioctl headers to userspace
  xen: add /dev/xen/evtchn driver
  xen: add irq_from_evtchn
  xen: clean up gate trap/interrupt constants
  xen: set _PAGE_NX in __supported_pte_mask before pagetable construction
  ...
2009-06-10 16:16:27 -07:00
Linus Torvalds
082b63ae45 Merge branch 'sched-docs-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-docs-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Document memory barriers implied by sleep/wake-up primitives
2009-06-10 15:48:53 -07:00
Linus Torvalds
99e97b860e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: fix typo in sched-rt-group.txt file
  ftrace: fix typo about map of kernel priority in ftrace.txt file.
  sched: properly define the sched_group::cpumask and sched_domain::span fields
  sched, timers: cleanup avenrun users
  sched, timers: move calc_load() to scheduler
  sched: Don't export sched_mc_power_savings on multi-socket single core system
  sched: emit thread info flags with stack trace
  sched: rt: document the risk of small values in the bandwidth settings
  sched: Replace first_cpu() with cpumask_first() in ILB nomination code
  sched: remove extra call overhead for schedule()
  sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus())
  wait: don't use __wake_up_common()
  sched: Nominate a power-efficient ilb in select_nohz_balancer()
  sched: Nominate idle load balancer from a semi-idle package.
  sched: remove redundant hierarchy walk in check_preempt_wakeup
2009-06-10 15:32:59 -07:00
Linus Torvalds
82782ca77d Merge branch 'x86-kbuild-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-kbuild-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (46 commits)
  x86, boot: add new generated files to the appropriate .gitignore files
  x86, boot: correct the calculation of ZO_INIT_SIZE
  x86-64: align __PHYSICAL_START, remove __KERNEL_ALIGN
  x86, boot: correct sanity checks in boot/compressed/misc.c
  x86: add extension fields for bootloader type and version
  x86, defconfig: update kernel position parameters
  x86, defconfig: update to current, no material changes
  x86: make CONFIG_RELOCATABLE the default
  x86: default CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN to 16 MB
  x86: document new bzImage fields
  x86, boot: make kernel_alignment adjustable; new bzImage fields
  x86, boot: remove dead code from boot/compressed/head_*.S
  x86, boot: use LOAD_PHYSICAL_ADDR on 64 bits
  x86, boot: make symbols from the main vmlinux available
  x86, boot: determine compressed code offset at compile time
  x86, boot: use appropriate rep string for move and clear
  x86, boot: zero EFLAGS on 32 bits
  x86, boot: set up the decompression stack as early as possible
  x86, boot: straighten out ranges to copy/zero in compressed/head*.S
  x86, boot: stylistic cleanups for boot/compressed/head_64.S
  ...

Fixed trivial conflict in arch/x86/configs/x86_64_defconfig manually
2009-06-10 15:30:41 -07:00
Linus Torvalds
f0d5e12bd4 Merge branch 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (76 commits)
  x86, apic: Fix dummy apic read operation together with broken MP handling
  x86, apic: Restore irqs on fail paths
  x86: Print real IOAPIC version for x86-64
  x86: enable_update_mptable should be a macro
  sparseirq: Allow early irq_desc allocation
  x86, io-apic: Don't mark pin_programmed early
  x86, irq: don't call mp_config_acpi_gsi() if update_mptable is not enabled
  x86, irq: update_mptable needs pci_routeirq
  x86: don't call read_apic_id if !cpu_has_apic
  x86, apic: introduce io_apic_irq_attr
  x86/pci: add 4 more return parameters to IO_APIC_get_PCI_irq_vector(), fix
  x86: read apic ID in the !acpi_lapic case
  x86: apic: Fixmap apic address even if apic disabled
  x86: display extended apic registers with print_local_APIC and cpu_debug code
  x86: read apic ID in the !acpi_lapic case
  x86: clean up and fix setup_clear/force_cpu_cap handling
  x86: apic: Check rev 3 fadt correctly for physical_apic bit
  x86/pci: update pirq_enable_irq() to setup io apic routing
  x86/acpi: move setup io apic routing out of CONFIG_ACPI scope
  x86/pci: add 4 more return parameters to IO_APIC_get_PCI_irq_vector()
  ...
2009-06-10 15:25:41 -07:00
Peter Zijlstra
bd2b5b1284 perf_counter: More aggressive frequency adjustment
Also employ the overflow handler to adjust the frequency, this results
in a stable frequency in about 40~50 samples, instead of that many ticks.

This also means we can start sampling at a sample period of 1 without
running head-first into the throttle.

It relies on sched_clock() to accurately measure the time difference
between the overflow NMIs.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-10 16:55:26 +02:00
Steven Rostedt
110bf2b764 tracing: add protection around module events unload
When reading the trace buffer, there is a race that when a module
is unloaded it removes events that is stilled referenced in the buffers.
This patch adds the protection around the unloading of the events
from modules and the reading of the trace buffers.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 17:29:07 -04:00
Manish Katiyar
ad6ccfad6f kernel/kallsyms.c: replace deprecated __initcall with device_initcall and fix whitespace
Fix coding style whitespace issues and replace __initcall with
device_initcall.  Fixed multi-line comments as per coding style.

Errors as reported by checkpatch.pl :-
Before:
total: 14 errors, 14 warnings, 487 lines checked
After :
total: 0 errors, 8 warnings, 507 lines checked

Compile tested binary verified as :-
Before:
 text    data     bss     dec     hex filename
 2405       4       0    2409     969 kernel/kallsyms.o
After :
 text     data     bss     dec     hex filename
 2405       4       0    2409     969 kernel/kallsyms.o

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2009-06-09 22:37:52 +02:00
Steven Rostedt
725c624a58 tracing: add trace_seq_vprint interface
The code to update the print formats for events requires a vprintf
format in the trace_seq. This patch adds that interface.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 15:17:32 -04:00
Li Zefan
55782138e4 tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:34:23 -04:00
Steven Rostedt
f57a8a1911 ring-buffer: fix ret in rb_add_time_stamp
The update of ret got mistakenly added to the if statement of
rb_try_to_discard. The variable ret should be 1 on commit and zero
otherwise.

[ Impact: fix compiler warning and real bug ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:33:30 -04:00
Yinghai Lu
eaa958402e cpumask: alloc zeroed cpumask for static cpumask_var_ts
These are defined as static cpumask_var_t so if MAXSMP is not used,
they are cleared already.  Avoid surprises when MAXSMP is enabled.

Signed-off-by: Yinghai Lu <yinghai.lu@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-06-09 22:30:27 +09:30
James Morris
0b4ec6e4e0 Merge branch 'master' into next 2009-06-09 09:27:53 +10:00
Peter Zijlstra
1f8a6a10fb ring-buffer: pass in lockdep class key for reader_lock
On Sun, 7 Jun 2009, Ingo Molnar wrote:
> Testing tracer sched_switch: <6>Starting ring buffer hammer
> PASSED
> Testing tracer sysprof: PASSED
> Testing tracer function: PASSED
> Testing tracer irqsoff:
> =============================================
> PASSED
> Testing tracer preemptoff: PASSED
> Testing tracer preemptirqsoff: [ INFO: possible recursive locking detected ]
> PASSED
> Testing tracer branch: 2.6.30-rc8-tip-01972-ge5b9078-dirty #5760
> ---------------------------------------------
> rb_consumer/431 is trying to acquire lock:
>  (&cpu_buffer->reader_lock){......}, at: [<c109eef7>] ring_buffer_reset_cpu+0x37/0x70
>
> but task is already holding lock:
>  (&cpu_buffer->reader_lock){......}, at: [<c10a019e>] ring_buffer_consume+0x7e/0xc0
>
> other info that might help us debug this:
> 1 lock held by rb_consumer/431:
>  #0:  (&cpu_buffer->reader_lock){......}, at: [<c10a019e>] ring_buffer_consume+0x7e/0xc0

The ring buffer is a generic structure, and can be used outside of
ftrace. If ftrace traces within the use of the ring buffer, it can produce
false positives with lockdep.

This patch passes in a static lock key into the allocation of the ring
buffer, so that different ring buffers will have their own lock class.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1244477919.13761.9042.camel@twins>

[ store key in ring buffer descriptor ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-08 18:50:20 -04:00
Linus Torvalds
3af968e066 async: Fix lack of boot-time console due to insufficient synchronization
Our async work synchronization was broken by "async: make sure
independent async domains can't accidentally entangle" (commit
d5a877e8dd), because it would report
the wrong lowest active async ID when there was both running and
pending async work.

This caused things like no being able to read the root filesystem,
resulting in missing console devices and inability to run 'init',
causing a boot-time panic.

This fixes it by properly returning the lowest pending async ID: if
there is any running async work, that will have a lower ID than any
pending work, and we should _not_ look at the pending work list.

There were alternative patches from Jaswinder and James, but this one
also cleans up the code by removing the pointless 'ret' variable and
the unnecesary testing for an empty list around 'for_each_entry()' (if
the list is empty, the for_each_entry() thing just won't execute).

Fixes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=13474
Reported-and-tested-by: Chris Clayton <chris2553@googlemail.com>
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-08 12:31:53 -07:00
Artem Bityutskiy
8daa21e61b hrtimer: export ktime_add_safe
We want to use hrtimers in UBIFS (for write-buffer write-back timer).
We need the 'hrtimer_set_expires_range_ns()', which is an in-line
function which uses 'ktime_add_safe()'.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-06-08 11:14:58 +03:00
Ingo Molnar
8326f44da0 perf_counter: Implement generalized cache event types
Extend generic event enumeration with the PERF_TYPE_HW_CACHE
method.

This is a 3-dimensional space:

       { L1-D, L1-I, L2, ITLB, DTLB, BPU } x
       { load, store, prefetch } x
       { accesses, misses }

User-space passes in the 3 coordinates and the kernel provides
a counter. (if the hardware supports that type and if the
combination makes sense.)

Combinations that make no sense produce a -EINVAL.
Combinations that are not supported by the hardware produce -ENOTSUP.

Extend the tools to deal with this, and rewrite the event symbol
parsing code with various popular aliases for the units and
access methods above. So 'l1-cache-miss' and 'l1d-read-ops' are
both valid aliases.

( x86 is supported for now, with the Nehalem event table filled in,
  and with Core2 and Atom having placeholder tables. )

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-06 13:14:47 +02:00
Ingo Molnar
a21ca2cac5 perf_counter: Separate out attr->type from attr->config
Counter type is a frequently used value and we do a lot of
bit juggling by encoding and decoding it from attr->config.

Clean this up by creating a separate attr->type field.

Also clean up the various similarly complex user-space bits
all around counter attribute management.

The net improvement is significant, and it will be easier
to add a new major type (which is what triggered this cleanup).

(This changes the ABI, all tools are adapted.)
(PowerPC build-tested.)

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-06 11:37:22 +02:00
Peter Zijlstra
6a24ed6c60 perf_counter: Fix frequency adjustment for < HZ
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-05 18:07:48 +02:00
Peter Zijlstra
689802b2d0 perf_counter: Add PERF_SAMPLE_PERIOD
In order to allow easy tracking of the period, also provide means of
adding it to the sample data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-05 18:07:47 +02:00
Peter Zijlstra
ac4bcf8894 perf_counter: Change PERF_SAMPLE_CONFIG into PERF_SAMPLE_ID
The purpose of PERF_SAMPLE_CONFIG was to identify the counters,
since then we've added counter ids, use those instead.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-05 18:07:47 +02:00
Ingo Molnar
918143e8b7 Merge branch 'tip/tracing/ftrace-4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-06-05 16:50:29 +02:00
Peter Zijlstra
089dd79db9 perf_counter: Generate mmap events for install_special_mapping()
In order to track the vdso also generate mmap events for
install_special_mapping().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-05 14:46:41 +02:00
Paul Mackerras
6dc5f2a417 perf_counter: Fix lockup with interrupting counters
Commit 8e3747c1 ("perf_counter: Change data head from u32 to u64")
changed the type of 'head' in struct perf_mmap_data from atomic_t
to atomic_long_t, but missed converting one use of atomic_read on
it to atomic_long_read.  The effect of using atomic_read rather than
atomic_long_read on powerpc (and other big-endian architectures) is
that we get the high half of the 64-bit quantity, resulting in the
cmpxchg retry loop in perf_output_begin spinning forever as soon as
data->head becomes non-zero.  On little-endian architectures such as
x86 we would get the low half, resulting in a lockup once data->head
becomes greater than 4G.

This fixes it by using atomic_long_read rather than atomic_read.

[ Impact: fix perfcounter lockup on PowerPC / big-endian systems ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <18984.33964.21541.743096@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-05 08:22:26 +02:00
Oleg Nesterov
edaba2c533 ptrace: revert "ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logic"
Commit 95a3540da9 ("ptrace_detach: the wrong
wakeup breaks the ERESTARTxxx logic") removed the "extra"
wake_up_process() from ptrace_detach(), but as Jan pointed out this breaks
the compatibility.

I believe the changelog is right and this wake_up() is wrong in many
ways, but GDB assumes that ptrace(PTRACE_DETACH, child, 0, 0) always
wakes up the tracee.

Despite the fact this breaks SIGNAL_STOP_STOPPED/group_stop_count logic,
and despite the fact this wake_up_process() can break another
assumption: PTRACE_DETACH with SIGSTOP should leave the tracee in
TASK_STOPPED case.  Because the untraced child can dequeue SIGSTOP and
call do_signal_stop() before ptrace_detach() calls wake_up_process().

Revert this change for now.  We need some fixes even if we we want to keep
the current behaviour, but these fixes are not for 2.6.30.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-04 18:07:40 -07:00
Oleg Nesterov
087eb43705 ptrace: tracehook_report_clone: fix false positives
The "trace || CLONE_PTRACE" check in tracehook_report_clone() is not right,

- If the untraced task does clone(CLONE_PTRACE) the new child is not traced,
  we must not queue SIGSTOP.

- If we forked the traced task, but the tracer exits and untraces both the
  forking task and the new child (after copy_process() drops tasklist_lock),
  we should not queue SIGSTOP too.

Change the code to check task_ptrace() != 0 instead. This is still racy, but
the race is harmless.

We can race with another tracer attaching to this child, or the tracer can
exit and detach in parallel. But giwen that we didn't do wake_up_new_task()
yet, the child must have the pending SIGSTOP anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-04 18:07:40 -07:00
Peter Zijlstra
d99e944620 perf_counter: Remove munmap stuff
In name of keeping it simple, only track mmap events. Userspace
will have to remove old overlapping maps when it encounters them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-04 17:51:38 +02:00
Peter Zijlstra
60313ebed7 perf_counter: Add fork event
Create a fork event so that we can easily clone the comm and
dso maps without having to generate all those events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-04 17:51:38 +02:00
Ingo Molnar
64edbc5620 Merge branch 'tracing/ftrace' into tracing/core
Merge reason: this mini-topic had outstanding problems that delayed
              its merge, so it does not fast-forward.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-04 13:59:40 +02:00
Christoph Lameter
e0a94c2a63 security: use mmap_min_addr indepedently of security models
This patch removes the dependency of mmap_min_addr on CONFIG_SECURITY.
It also sets a default mmap_min_addr of 4096.

mmapping of addresses below 4096 will only be possible for processes
with CAP_SYS_RAWIO.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Eric Paris <eparis@redhat.com>
Looks-ok-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: James Morris <jmorris@namei.org>
2009-06-04 12:07:48 +10:00
Ingo Molnar
128f048f0f perf_counter: Fix throttling lock-up
Throttling logic is broken and we can lock up with too small
hw sampling intervals.

Make the throttling code more robust: disable counters even
if we already disabled them.

( Also clean up whitespace damage i noticed while reading
  various pieces of code related to throttling. )

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-03 23:39:51 +02:00
Steven Rostedt
563af16c30 tracing: add annotation to what type of stack trace is recorded
The current method of printing out a stack trace is to add a new line
and print out the trace:

    yum-updatesd-3120  [002]   573.691303:
 => do_softirq
 => irq_exit
 => smp_apic_timer_interrupt
 => apic_timer_interrupt

This looks a bit awkward, and if we have both stack and user stack traces
running, it would be nice to have a title to tell them apart, although
it is easy to tell by the output.

This patch adds an annotation to the start of the stack traces:

            init-1     [003]   929.304979: <stack trace>
 => user_path_at
 => vfs_fstatat
 => vfs_stat
 => sys_newstat
 => system_call_fastpath

             cat-3459  [002]  1016.824040: <user stack trace>
 =>  <0000003aae6c0250>
 =>  <00007ffff4b06ae4>
 =>  <69636172742f6775>

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 11:10:44 -04:00
Steven Whitehouse
56d8bd3f0b tracing: fix multiple use of __print_flags and __print_symbolic
Here is an updated patch to include the extra call to
trace_seq_init() as requested. This is vs. the latest
-tip tree and fixes the use of multiple __print_flags
and __print_symbolic in a single tracer. Also tested
to ensure its working now:

mount.gfs2-2534  [000]   235.850587: gfs2_glock_queue: 8.7 glock 1:2 dequeue PR
mount.gfs2-2534  [000]   235.850591: gfs2_demote_rq: 8.7 glock 1:0 demote EX to NL flags:DI
mount.gfs2-2534  [000]   235.850591: gfs2_glock_queue: 8.7 glock 1:0 dequeue EX
glock_workqueue-2529  [000]   235.850666: gfs2_glock_state_change: 8.7 glock 1:0 state EX => NL tgt:NL dmt:NL flags:lDpI
glock_workqueue-2529  [000]   235.850672: gfs2_glock_put: 8.7 glock 1:0 state NL => IV flags:I

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
LKML-Reference: <1244037123.29604.603.camel@localhost.localdomain>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 10:29:48 -04:00
walimis
048dc50c5e tracing/events: fix output format of user stack
According to "events/ftrace/user_stack/format", fix the output of
user stack.

before fix:

  sh-1073  [000]    31.137561:  <b7f274fe> <-  <0804e33c> <-  <080835c1>

after fix:

  sh-1072  [000]    37.039329:
 =>  <b7f8a4fe>
 =>  <0804e33c>
 =>  <080835c1>

Signed-off-by: walimis <walimisdev@gmail.com>
LKML-Reference: <1244016090-7814-3-git-send-email-walimisdev@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 10:25:30 -04:00
walimis
f11b3f4e29 tracing/events: fix output format of kernel stack
According to "events/ftrace/kernel_stack/format", output format of
kernel stack should use "=>" instead of "<=".

The second problem is that we shouldn't skip the first entry in the stack,
although it seems to be duplicated when used in the "function" tracer,
but events also use it. If we skip the first one, we will drop the topmost
entry of the stack.

The last problem is that if the last entry is ULONG_MAX(0xffffffff), we should
drop it, otherwise it will print a NULL name line.

before fix:

      sh-1072  [000]   26.957239: sched_process_fork: parent sh:1072 child sh:1073
      sh-1072  [000]   26.957262:
 <= syscall_call
 <=
      sh-1072  [000]   26.957744: sched_switch: task sh:1072 [120] (R) ==> sh:1073 [120]
      sh-1072  [000]   26.957752:
 <= preempt_schedule
 <= wake_up_new_task
 <= do_fork
 <= sys_clone
 <= syscall_call
 <=

After fix:

      sh-1075  [000]    39.791848: sched_process_fork: parent sh:1075  child sh:1076
      sh-1075  [000]    39.791871:
 => sys_clone
 => syscall_call
      sh-1075  [000]    39.792713: sched_switch: task sh:1075 [120] (R) ==> sh:1076 [120]
      sh-1075  [000]    39.792722:
 => schedule
 => preempt_schedule
 => wake_up_new_task
 => do_fork
 => sys_clone
 => syscall_call

Signed-off-by: walimis <walimisdev@gmail.com>
LKML-Reference: <1244016090-7814-2-git-send-email-walimisdev@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 10:25:15 -04:00
walimis
083a63b48e tracing/trace_stack: fix the number of entries in the header
The last entry in the stack_dump_trace is ULONG_MAX, which is not
a valid entry, but max_stack_trace.nr_entries has accounted for it.
So when printing the header, we should decrease it by one.
Before fix, print as following, for example:

	Depth    Size   Location    (53 entries)	<--- should be 52
	-----    ----   --------
  0)     3264     108   update_wall_time+0x4d5/0x9a0
  ...
 51)       80      80   syscall_call+0x7/0xb
 ^^^
   it's correct.

Signed-off-by: walimis <walimisdev@gmail.com>
LKML-Reference: <1244016090-7814-1-git-send-email-walimisdev@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 10:24:44 -04:00
Steven Rostedt
ea05b57cc1 ring-buffer: discard timestamps that are at the start of the buffer
Every buffer page in the ring buffer includes its own time stamp.
When an event is recorded to the ring buffer with a delta time greater
than what can be held in the event header, a time stamp event is created.

If the the create timestamp falls over to the next buffer page, it is
redundant because the buffer page holds a full time stamp. This patch
will try to discard the time stamp when it falls to the start of the
next page.

This change also fixes a issues with disarding events. If most events are
discarded, timestamps will start to creep into the ring buffer. If we
do not discard the timestamps then they can fill up the ring buffer over
time and waste space.

This change will keep time stamps from filling up over another page. If
something is recorded in the buffer page, and the rest is filtered, then
the time stamps can only fill up to the end of the page.

[ Impact: prevent time stamps from filling ring buffer ]

Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 10:15:25 -04:00
Steven Rostedt
edd813bffc ring-buffer: try to discard unneeded timestamps
There are times that a race may happen that we add a timestamp in a
nested write. This timestamp would just contain a zero delta and serves
no purpose.

Now that we have a way to discard events, this patch will try to discard
the timestamp instead of just wasting the space in the ring buffer.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 10:15:22 -04:00
Tim Bird
a202355640 ring-buffer: fix bug in ring_buffer_discard_commit
There's a bug in ring_buffer_discard_commit.  The wrong
pointer is being compared in order to check if the event
can be freed from the buffer rather than discarded
(i.e. marked as PAD).

I noticed this when I was working on duration filtering.
The bug is not deadly - it just results in lots of wasted
space in the buffer.  All filtered events are left in
the buffer and marked as discarded, rather than being
removed from the buffer to make space for other events.

Unfortunately, when I fixed this bug, I got errors doing a
filtered function trace.  Multiple TIME_EXTEND
events pile up in the buffer, and trigger the
following loop overage warning in rb_iter_peek():

again:
	...
	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 10))
		return NULL;

I'm not sure what the best way is to fix this. I don't
know if I should extend the loop threshhold, or if I should
make the test more complex (ignore TIME_EXTEND
events), or just get rid of this loop check completely.

Note that if I implement a workaround for this, then I
see another problem from rb_advance_iter().  I haven't
tracked that one down yet.

In general, it seems like the case of removing filtered
events has not been working properly, and so some assumptions
about buffer invariant conditions need to be revisited.

Here's the patch for the simple fix:

Compare correct pointer for checking if an event can be
freed rather than left as discarded in the buffer.

Signed-off-by: Tim Bird <tim.bird@am.sony.com>
LKML-Reference: <4A25BE9E.5090909@am.sony.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-03 10:15:06 -04:00
Peter Zijlstra
a96bbc1641 perf_counter: Fix race in counter initialization
We need the PID namespace and counter ID available when the
counter overflows and we need to generate a sample event.

[ Impact: fix kernel crash with high-frequency sampling ]

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
[ fixed a further crash and cleaned up the initialization a bit ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-03 14:57:03 +02:00
Peter Zijlstra
226f62fdd5 perf_counter: Add a comm hook for pure fork()s
I noticed missing COMM events and found that we missed
reporting them for pure forks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-03 14:14:30 +02:00
Steven Rostedt
84047e360a function-graph: always initialize task ret_stack
On creating a new task while running the function graph tracer, if
we fail to allocate the ret_stack, and then fail the fork, the
code will free the parent ret_stack. This is because the child
duplicated the parent and currently points to the parent's ret_stack.

This patch always initializes the task's ret_stack to NULL.

[ Impact: prevent crash of parent on low memory during fork ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-02 16:51:55 -04:00
Steven Rostedt
f7e8b616ed function-graph: move initialization of new tasks up in fork
When the function graph tracer is enabled, all new tasks must allocate
a ret_stack to place the return address of functions. This is because
the function graph tracer will replace the real return address with a
call to the tracing of the exit function.

This initialization happens in fork, but it happens too late. If fork
fails, then it will call free_task and that calls the freeing of this
ret_stack. But before initialization happens, the new (failed) task
points to its parents ret_stack. If a fork failure happens during
the function trace, it would be catastrophic for the parent.

Also, there's no need to call ftrace_graph_exit_task from fork, since
it is called by free_task which fork calls on failure.

[ Impact: prevent crash during failed fork running function graph tracer ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-02 16:49:57 -04:00
Peter Zijlstra
0d48696f87 perf_counter: Rename perf_counter_hw_event => perf_counter_attr
The structure isn't hw only and when I read event, I think about those
things that fall out the other end. Rename the thing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
Cc: Stephane Eranian <eranian@googlemail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 21:45:33 +02:00
Peter Zijlstra
08247e31ca perf_counter: Add ioctl for changing the sample period/frequency
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 21:45:32 +02:00
Peter Zijlstra
8e3747c13c perf_counter: Change data head from u32 to u64
Since some people worried that 4G might not be a large enough
as an mmap data window, extend it to 64 bit for capable
platforms.

Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 21:45:32 +02:00
Peter Zijlstra
b23f3325ed perf_counter: Rename various fields
A few renames:

  s/irq_period/sample_period/
  s/irq_freq/sample_freq/
  s/PERF_RECORD_/PERF_SAMPLE_/
  s/record_type/sample_type/

And change both the new sample_type and read_format to u64.

Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 21:45:30 +02:00
Peter Zijlstra
8e5799b1ad perf_counter: Add unique counter id
Stephan raised the issue that we currently cannot distinguish between
similar counters within a group (PERF_RECORD_GROUP uses the config
value as identifier).

Therefore, generate a new ID for each counter using a global u64
sequence counter.

Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 21:45:29 +02:00
Steven Rostedt
26c01624a2 function-graph: add memory barriers for accessing task's ret_stack
The code that handles the tasks ret_stack allocation for every task
assumes that only an interrupt can cause issues (even though interrupts
are disabled).

In reality, the code is allocating the ret_stack for tasks that may be
running on other CPUs and there are not efficient memory barriers to
handle this case.

[ Impact: prevent crash due to using of uninitialized ret_stack variables ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-02 14:42:17 -04:00
Steven Rostedt
82310a3272 function-graph: enable the stack after initialization of other variables
The function graph tracer checks if the task_struct has ret_stack defined
to know if it is OK or not to use it. The initialization is done for
all tasks by one process, but the idle tasks use the same initialization
used by new tasks.

If an interrupt happens on an idle task that just had the ret_stack
created, but before the rest of the initialization took place, then
we can corrupt the return address of the functions.

This patch moves the setting of the task_struct's ret_stack to after
the other variables have been initialized.

[ Impact: prevent kernel panic on idle task when starting function graph ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-02 14:41:50 -04:00
Steven Rostedt
179c498ae2 function-graph: only allocate init tasks if it was not already done
When the function graph tracer is enabled, it calls the initialization
needed for the init tasks that would be called on all created tasks.

The problem is that this is called every time the function graph tracer
is enabled, and the ret_stack is allocated for the idle tasks each time.
Thus, the old ret_stack is lost and a memory leak is created.

This is also dangerous because if an interrupt happened on another CPU
with the init task and the ret_stack is replaced, we then lose all the
return pointers for the interrupt, and a crash would take place.

[ Impact: fix memory leak and possible crash due to race ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-02 12:03:19 -04:00
Peter Zijlstra
709e50cf87 perf_counter: Use PID namespaces properly
Stop using task_struct::pid and start using PID namespaces.

PIDs will be reported in the PID namespace of the monitoring
task at the moment of counter creation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 16:16:25 +02:00
Paul Mackerras
bf4e0ed3d0 perf_counter: Remove unused prev_state field
This removes the prev_state field of struct perf_counter since
it is now unused.  It was only used by the cpu migration
counter, which doesn't use it any more.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <18979.35052.915728.626374@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 13:10:55 +02:00
Paul Mackerras
3f731ca60a perf_counter: Fix cpu migration counter
This fixes the cpu migration software counter to count
correctly even when contexts get swapped from one task to
another.  Previously the cpu migration counts reported by perf
stat were bogus, ranging from negative to several thousand for
a single "lat_ctx 2 8 32" run.  With this patch the cpu
migration count reported for "lat_ctx 2 8 32" is almost always
between 35 and 44.

This fixes the problem by adding a call into the perf_counter
code from set_task_cpu when tasks are migrated.  This enables
us to use the generic swcounter code (with some modifications)
for the cpu migration counter.

This modifies the swcounter code to allow a NULL regs pointer
to be passed in to perf_swcounter_ctx_event() etc.  The cpu
migration counter does this because there isn't necessarily a
pt_regs struct for the task available.  In this case, the
counter will not have interrupt capability - but the migration
counter didn't have interrupt capability before, so this is no
loss.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <18979.35006.819769.416327@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 13:10:54 +02:00
Paul Mackerras
f38b082081 perf_counter: Initialize per-cpu context earlier on cpu up
This arranges for perf_counter's notifier for cpu hotplug
operations to be called earlier than the migration notifier in
sched.c by increasing its priority to 20, compared to the 10
for the migration notifier.  The reason for doing this is that
a subsequent commit to convert the cpu migration counter to use
the generic swcounter infrastructure will add a call into the
perf_counter subsystem when tasks get migrated.  Therefore the
perf_counter subsystem needs a chance to initialize its per-cpu
data for the new cpu before it can get called from the
migration code.

This also adds a comment to the migration notifier noting that
its priority needs to be lower than that of the perf_counter
notifier.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <18981.1900.792795.836858@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-02 13:10:54 +02:00
Steven Rostedt
0f6ce3de4e ftrace: do not profile functions when disabled
A race was found that if one were to enable and disable the function
profiler repeatedly, then the system can panic. This was because a profiled
function may be preempted just before disabling interrupts. While
the profiler is disabled and then reenabled, the preempted function
could start again, and access the hash as it is being initialized.

This just adds a check in the irq disabled part to check if the profiler
is enabled, and if it is not then it will just exit.

When the system is disabled, the profile_enabled variable is cleared
before calling the unregistering of the function profiler. This
unregistering calls stop machine which also acts as a synchronize schedule.

[ Impact: fix panic in enabling/disabling function profiler ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-01 23:26:23 -04:00
Steven Rostedt
112f38a7e3 tracing: make trace pipe recognize latency format flag
The trace_pipe did not recognize the latency format flag and would produce
different output than the trace file. The problem was partly due that
the trace flags in the iterator was not set as well as the trace_pipe
zeros out part of the iterator (including the flags) to be able to use
the same routines as the trace file. trace_flags of the iterator should
not cause any problems when not zeroed out by for trace_pipe.

Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-01 23:26:02 -04:00
Steven Whitehouse
ec081ddc3d tracing: add exports to use __print_symbolic and __print_flags from a module
A patch to allow the use of __print_symbolic and __print_flags
from a module. This allows the current GFS2 tracing patch to
build.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
LKML-Reference: <1243868015.29604.542.camel@localhost.localdomain>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-01 23:25:29 -04:00
Li Zefan
7fcb7c472f tracing/events: introduce __dynamic_array()
__string() is limited:

  - it's a char array, but we may want to define array with other types
  - a source string should be available, but we may just know the string size

We introduce __dynamic_array() to break those limitations, and __string()
becomes a wrapper of it. As a side effect, now __get_str() can be used
in TP_fast_assign but not only TP_print.

Take XFS for example, we have the string length in the dirent, but the
string itself is not NULL-terminated, so __dynamic_array() can be used:

TRACE_EVENT(xfs_dir2,
	TP_PROTO(struct xfs_da_args *args),
	TP_ARGS(args),

	TP_STRUCT__entry(
		__field(int, namelen)
		__dynamic_array(char, name, args->namelen + 1)
		...
	),

	TP_fast_assign(
		char *name = __get_str(name);

		if (args->namelen)
			memcpy(name, args->name, args->namelen);
		name[args->namelen] = '\0';

		__entry->namelen = args->namelen;
	),

	TP_printk("name %.*s namelen %d",
		  __entry->namelen ? __get_str(name) : NULL
		  __entry->namelen)
);

[ Impact: allow defining dynamic size arrays ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2384D2.3080403@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-01 23:25:15 -04:00
Steven Rostedt
897f17a653 tracing: combine the default tracers into one config
Both event tracer and sched switch plugin are selected by default
by all generic tracers. But if no generic tracer is enabled, their options
appear. But ether one of them will select the other, thus it only
makes sense to have the default tracers be selected by one option.

[ Impact: clean up kconfig menu ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-01 23:23:55 -04:00
Steven Rostedt
5e0a093910 tracing: fix config options to not show when automatically selected
There are two options that are selected by all tracers, but we want
to have those options available when no tracer is selected. These are

 The event tracer and sched switch tracer.

The are enabled by all tracers, but if a tracer is not selected we want
the options to appear. All tracers including them select TRACING.
Thus what we would like to do is:

  config EVENT_TRACER
	bool "prompt"
	depends on TRACING
	select TRACING

But that gives us a bug in the kbuild system since we just created a
circular dependency. We only want the prompt to show when TRACING is off.

This patch adds GENERIC_TRACER that all tracers will select instead of
TRACING. The two options (sched switch and event tracer) will select
TRACING directly and depend on !GENERIC_TRACER. This solves the cicular
dependency.

[ Impact: hide options that are selected by default ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-01 23:23:30 -04:00
Steven Rostedt
2af15d6a44 ftrace: add kernel command line function filtering
When using ftrace=function on the command line to trace functions
on boot up, one can not filter out functions that are commonly called.

This patch adds two new ftrace command line commands.

  ftrace_notrace=function-list
  ftrace_filter=function-list

Where function-list is a comma separated list of functions to filter.
The ftrace_notrace will make the functions listed not be included
in the function tracing, and ftrace_filter will only trace the functions
listed.

These two act the same as the debugfs/tracing/set_ftrace_notrace and
debugfs/tracing/set_ftrace_filter respectively.

The simple glob expressions that are allowed by the filter files can also
be used by the command line interface.

	ftrace_notrace=rcu*,*lock,*spin*

Will not trace any function that starts with rcu, ends with lock, or has
the word spin in it.

Note, if the self tests are enabled, they may interfere with the filtering
set by the command lines.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-01 23:23:10 -04:00
Frederic Weisbecker
43bd123623 tracing/stat: remove unappropriate safe walk on list
register_stat_tracer() uses list_for_each_entry_safe
to check whether a tracer is already present in the list.
But we don't delete anything from the list here, so
we don't need the safe version

[ Impact: cleanup list use is stat tracing ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:18:33 +02:00
Li Zefan
dbd3fbdfee tracing/stat: do some cleanups
- remove duplicate code in stat_seq_init()
- update comments to reflect the change from stat list to stat rbtree

[ Impact: clean up ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:18:18 +02:00
Li Zefan
e162280690 tracing/stat: remember to free root node
When closing a trace_stat file, we destroy the rbtree constructed during
file open, but there is memory leak that the root node is not freed.

[ Impact: fix memory leak when closing a trace_stat file ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:18:03 +02:00
Li Zefan
b3dd7ba7d8 tracing/stat: change dummpy_cmp() to return -1
Currently the output of trace_stat/workqueues is totally reversed:

 # cat /debug/tracing/trace_stat/workqueues
    ...
    1       17       17      210       37   `-blk_unplug_work+0x0/0x57
    1     3779     3779      181       11   |-cfq_kick_queue+0x0/0x2f
    1     3796     3796                     kblockd/1:120
    ...

The correct output should be:

    1     3796     3796                     kblockd/1:120
    1     3779     3779      181       11   |-cfq_kick_queue+0x0/0x2f
    1       17       17      210       37   `-blk_unplug_work+0x0/0x57

It's caused by "tracing/stat: replace linked list by an rbtree for
sorting"
(53059c9b67a62a3dc8c80204d3da42b9267ea5a0).

dummpy_cmp() should return -1, so rb_node will always be inserted as
right-most node in the rbtree, thus we sort the output in ascending
order.

[ Impact: fix the output of trace_stat/workqueues ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:17:49 +02:00
Frederic Weisbecker
8f184f2730 tracing/stat: replace linked list by an rbtree for sorting
When the stat tracing framework prepares the entries from a tracer
to output them to the user, it starts by computing a linear sort
through a linked list to give the entries ordered by relevance
to the user.

This is quite ugly and causes a small latency when we begin to
read the file.

This patch changes that by turning the linked list into a red-black
tree. Athough the whole iteration using the start and next tracer
callbacks while opening the file remain the same, it is now much
more fast and scalable.

The rbtree guarantees O(log(n)) insertions whereas a linked
list with linear sorting brought us a O(n) despair. Now the
(visible) latency has disapeared.

[ Impact: kill the latency while starting to read a stat tracer file ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:17:35 +02:00
Frederic Weisbecker
0d64f8342d tracing/stat: replace trace_stat_session by stat_session
The "trace" prefix in struct trace_stat_session type is annoying while
reading the trace_stat.c file. It makes the lines longer, and
is not that much useful to explain the sense of this type.

Just keep "struct stat_session" for this type.

[ Impact: make the code a bit more readable ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:17:17 +02:00
Zhaolei
f3c4ae26e9 trace_workqueue: remove blank line between each cpu
The blankline between each cpu's workqueue stat is not necessary, because
the cpu number is enough to part them by eye.
Old style also caused a blankline below headline, and made code complex
by using lock, disableirq and get cpu var.

Old style:
 # CPU  INSERTED  EXECUTED   NAME
 # |      |         |          |

   0   8644       8644       events/0
   0      0          0       cpuset
   ...
   0      1          1       kdmflush

   1  35365      35365       events/1
   ...

New style:
 # CPU  INSERTED  EXECUTED   NAME
 # |      |         |          |

   0   8644       8644       events/0
   0      0          0       cpuset
   ...
   0      1          1       kdmflush
   1  35365      35365       events/1
   ...

[ Impact: provide more readable code ]

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:14:26 +02:00
Zhaolei
b8867164f0 trace_workqueue: remove cpu_workqueue_stats->first_entry
cpu_workqueue_stats->first_entry is useless because we can retrieve the
header of a cpu workqueue using:
if (&cpu_workqueue_stats->list == workqueue_cpu_stat(cpu)->list.next)

[ Impact: cleanup ]

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:13:46 +02:00
Zhaolei
1fdfca9c57 trace_workqueue: use list_for_each_entry() instead of list_for_each_entry_safe()
No need to use list_for_each_entry_safe() in iteration without deleting
any node, we can use list_for_each_entry() instead.

[ Impact: cleanup ]

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:13:05 +02:00
Zhaolei
fb39125fd7 ftrace, workqueuetrace: make workqueue tracepoints use TRACE_EVENT macro
v3: zhaolei@cn.fujitsu.com: Change TRACE_EVENT definition to new format
    introduced by Steven Rostedt: consolidate trace and trace_event headers
v2: kosaki@jp.fujitsu.com: print the function names instead of addr, and zap
    the work addr
v1: zhaolei@cn.fujitsu.com: Make workqueue tracepoints use TRACE_EVENT macro

TRACE_EVENT is a more generic way to define tracepoints.
Doing so adds these new capabilities to the tracepoints:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions

Then, this patch converts DEFINE_TRACE to TRACE_EVENT in workqueue related
tracepoints.

[ Impact: expand workqueue tracer to events tracing ]

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-06-02 01:10:40 +02:00
H. Peter Anvin
48b1fddbb1 Merge branch 'irq/numa' into x86/mce3
Merge reason: arch/x86/kernel/irqinit_{32,64}.c unified in irq/numa
and modified in x86/mce3; this merge resolves the conflict.

Conflicts:
	arch/x86/kernel/irqinit.c

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-06-01 15:25:31 -07:00
Ingo Molnar
3d58f48ba0 Merge branch 'linus' into irq/numa
Conflicts:
	arch/mips/sibyte/bcm1480/irq.c
	arch/mips/sibyte/sb1250/irq.c

Merge reason: we gathered a few conflicts plus update to latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 21:06:21 +02:00
Ingo Molnar
22a4f650d6 perf_counter: Tidy up style details
- whitespace fixlets
 - make local variable definitions more consistent

[ Impact: cleanup ]

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 19:55:32 +02:00
Paul Mackerras
880ca15adf perf_counter: Allow software counters to count while task is not running
This changes perf_swcounter_match() so that per-task software
counters can count events that occur while their associated
task is not running.  This will allow us to use the generic
software counter code for counting task migrations, which can
occur while the task is not scheduled in.

To do this, we have to distinguish between the situations where
the counter is inactive because its task has been scheduled
out, and those where the counter is inactive because it is part
of a group that was not able to go on the PMU.  In the former
case we want the counter to count, but not in the latter case.
If the context is active, we have the latter case.  If the
context is inactive then we need to know whether the counter
was counting when the context was last active, which we can
determine by comparing its ->tstamp_stopped timestamp with the
context's timestamp.

This also folds three checks in perf_swcounter_match, checking
perf_event_raw(), perf_event_type() and perf_event_id()
individually, into a single 64-bit comparison on
counter->hw_event.config, as an optimization.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <18979.34810.259718.955621@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 10:04:06 +02:00
Paul Mackerras
25346b93ca perf_counter: Provide functions for locking and pinning the context for a task
This abstracts out the code for locking the context associated
with a task.  Because the context might get transferred from
one task to another concurrently, we have to check after
locking the context that it is still the right context for the
task and retry if not.  This was open-coded in
find_get_context() and perf_counter_init_task().

This adds a further function for pinning the context for a
task, i.e. marking it so it can't be transferred to another
task.  This adds a 'pin_count' field to struct
perf_counter_context to indicate that a context is pinned,
instead of the previous method of setting the parent_gen count
to all 1s.  Pinning the context with a pin_count is easier to
undo and doesn't require saving the parent_gen value.  This
also adds a perf_unpin_context() to undo the effect of
perf_pin_task_context() and changes perf_counter_init_task to
use it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <18979.34748.755674.596386@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 10:04:05 +02:00
Ingo Molnar
23db9f430b Merge branch 'linus' into perfcounters/core
Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6
              based - to pick up the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 10:01:39 +02:00
Peter Zijlstra
bbbee90829 perf_counter: Ammend cleanup in fork() fail
When fork() fails we cannot use perf_counter_exit_task() since that
assumes to operate on current. Write a new helper that cleans up
unused/clean contexts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-29 16:21:52 +02:00
Peter Zijlstra
665c2142a9 perf_counter: Clean up task_ctx vs interrupts
Remove the local_irq_save() etc.. in routines that are smp function
calls, or have IRQs disabled by other means.

Then change the COMM, MMAP, and swcounter context iteration to
current->perf_counter_ctxp and RCU, since it really doesn't matter
which context they iterate, they're all folded.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-29 16:21:51 +02:00
Peter Zijlstra
efb3d17240 perf_counter: Fix COMM and MMAP events for cpu wide counters
Commit a63eaf34ae ("perf_counter: Dynamically allocate tasks'
perf_counter_context struct") broke COMM and MMAP notification for
cpu wide counters by dropping out early if there was no task context,
thereby also not iterating the cpu context.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-29 16:21:51 +02:00
Ingo Molnar
012b84dae1 perf_counter: Robustify counter-free logic
This fixes a nasty crash and highlights a bug that we were
freeing failed-fork() counters incorrectly.

(the fix for that will come separately)

[ Impact: fix crashes/lockups with inherited counters ]

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-29 14:28:37 +02:00
Ingo Molnar
3f4dee2273 perf_counter: Fix cpuctx->task_ctx races
Peter noticed that we are sometimes reading cpuctx->task_ctx with
interrupts enabled.

Noticed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-29 14:28:36 +02:00
Paul Mackerras
ad3a37de81 perf_counter: Don't swap contexts containing locked mutex
Peter Zijlstra pointed out that under some circumstances, we can take
the mutex in a context or a counter and then swap that context or
counter to another task, potentially leading to lock order inversions
or the mutexes not protecting what they are supposed to protect.

This fixes the problem by making sure that we never take a mutex in a
context or counter which could get swapped to another task.  Most of
the cases where we take a mutex is on a top-level counter or context,
i.e. a counter which has an fd associated with it or a context that
contains such a counter.  This adds WARN_ON_ONCE statements to verify
that.

The two cases where we need to take the mutex on a context that is a
clone of another are in perf_counter_exit_task and
perf_counter_init_task.  The perf_counter_exit_task case is solved by
uncloning the context before starting to remove the counters from it.
The perf_counter_init_task is a little trickier; we temporarily
disable context swapping for the parent (forking) task by setting its
ctx->parent_gen to the all-1s value after locking the context, if it
is a cloned context, and restore the ctx->parent_gen value at the end
if the context didn't get uncloned in the meantime.

This also moves the increment of the context generation count to be
within the same critical section, protected by the context mutex, that
adds the new counter to the context.  That way, taking the mutex is
sufficient to ensure that both the counter list and the generation
count are stable.

[ Impact: fix hangs, races with inherited and PID counters ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <18975.31580.520676.619896@drongo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-29 11:02:46 +02:00
Andi Kleen
a9862e0560 Export add_timer_on for modules
Needed in followon patch.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-05-28 09:24:13 -07:00
Paul Mackerras
c93f766909 perf_counter: Fix race in attaching counters to tasks and exiting
Commit 564c2b21 ("perf_counter: Optimize context switch between
identical inherited contexts") introduced a race where it is possible
that a counter being attached to a task could get attached to the
wrong task, if the task is one that has inherited its context from
another task via fork.  This happens because the optimized context
switch could switch the context to another task after find_get_context
has read task->perf_counter_ctxp.  In fact, it's possible that the
context could then get freed, if the other task then exits.

This fixes the problem by protecting both the context switch and the
critical code in find_get_context with spinlocks.  The context switch
locks the cxt->lock of both the outgoing and incoming contexts before
swapping them.  That means that once code such as find_get_context
has obtained the spinlock for the context associated with a task,
the context can't get swapped to another task.  However, the context
may have been swapped in the interval between reading
task->perf_counter_ctxp and getting the lock, so it is necessary to
check and retry.

To make sure that none of the contexts being looked at in
find_get_context can get freed, this changes the context freeing code
to use RCU.  Thus an rcu_read_lock() is sufficient to ensure that no
contexts can get freed.  This part of the patch is lifted from a patch
posted by Peter Zijlstra.

This also adds a check to make sure that we can't add a counter to a
task that is exiting.

There is also a race between perf_counter_exit_task and
find_get_context; this solves the race by moving the get_ctx that
was in perf_counter_alloc into the locked region in find_get_context,
so that once find_get_context has got the context for a task, it
won't get freed even if the task calls perf_counter_exit_task.  It
doesn't matter if new top-level (non-inherited) counters get attached
to the context after perf_counter_exit_task has detached the context
from the task.  They will just stay there and never get scheduled in
until the counters' fds get closed, and then perf_release will remove
them from the context and eventually free the context.

With this, we are now doing the unclone in find_get_context rather
than when a counter was added to or removed from a context (actually,
we were missing the unclone_ctx() call when adding a counter to a
context).  We don't need to unclone when removing a counter from a
context because we have no way to remove a counter from a cloned
context.

This also takes out the smp_wmb() in find_get_context, which Peter
Zijlstra pointed out was unnecessary because the cmpxchg implies a
full barrier anyway.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <18974.33033.667187.273886@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-28 15:03:50 +02:00
Heiko Carstens
5b6045a906 trace: disable preemption before taking raw spinlocks
s390 code uses smp_processor_id() in __raw_spin_lock() code which
reveals that a (raw) spinlock is taken without preemption disabled.
This can potentially deadlock.

To fix this explicitly disable and enable preemption.

BUG: using smp_processor_id() in preemptible [00000000] code: cat/2278
caller is trace_find_cmdline+0x40/0xfc
CPU: 0 Not tainted 2.6.30-rc7-dirty #39
Process cat (pid: 2278, task: 000000003faedb68, ksp: 000000003b33b988)
000000003b33b988 000000003b33bae0 0000000000000002 0000000000000000
       000000003b33bb80 000000003b33baf8 000000003b33baf8 00000000000175d6
       0000000000000001 000000003b33b988 000000003f9b0000 000000000000000b
       000000000000000c 000000003b33bb40 000000003b33bae0 0000000000000000
       0000000000000000 00000000000175d6 000000003b33bae0 000000003b33bb28
Call Trace:
([<00000000000174b2>] show_trace+0x112/0x170)
 [<0000000000017582>] show_stack+0x72/0x100
 [<0000000000441538>] dump_stack+0xc8/0xd8
 [<000000000025c350>] debug_smp_processor_id+0x114/0x130
 [<00000000000bf0e4>] trace_find_cmdline+0x40/0xfc
 [<00000000000c35d4>] trace_print_context+0x58/0xac
 [<00000000000bb676>] print_trace_line+0x416/0x470
 [<00000000000bc8fe>] s_show+0x4e/0x428
 [<000000000013834e>] seq_read+0x36a/0x5d4
 [<0000000000112a78>] vfs_read+0xc8/0x174
 [<0000000000112c58>] SyS_read+0x74/0xc4
 [<000000000002c7ae>] sysc_noemu+0x10/0x16
 [<000002000012436c>] 0x2000012436c
1 lock held by cat/2278:
 #0:  (&p->lock){+.+.+.}, at: [<0000000000138056>] seq_read+0x72/0x5d4

[ Impact: fix preempt-unsafe raw spinlock ]

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-28 01:21:03 +02:00
Eero Nurkkala
f2e21c9610 NOHZ: Properly feed cpufreq ondemand governor
A call from irq_exit() may occasionally pause the timing
info for cpufreq ondemand governor. This results in the
cpufreq ondemand governor to fail to calculate the 
system load properly. Thus, relocate the checks for this
particular case to keep the governor always functional.

Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Reported-by: Tero Kristo <tero.kristo@nokia.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-27 15:33:43 +02:00
Tetsuo Handa
ab2b7ebaad kmod: Release sub_info on cred allocation failure.
call_usermodehelper_setup() forgot to kfree(sub_info)
when prepare_usermodehelper_creds() failed.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-26 12:11:19 -07:00
Steven Rostedt
0f4fc29dd6 tracing: add __print_symbolic to trace events
This patch adds __print_symbolic which is similar to __print_flags but
works for an enumeration type instead. That is, there is only a one to one
mapping between the values and the symbols. When a match is made, then
it is printed, otherwise the hex value is outputed.

[ Impact: add interface for showing symbol names in events ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-26 20:31:50 +02:00
Steven Rostedt
be74b73a57 tracing: add __print_flags for events
Developers have been asking for the ability in the ftrace event tracer
to display names of bits in a flags variable.

Instead of printing out c2, it would be easier to read FOO|BAR|GOO,
assuming that FOO is bit 1, BAR is bit 6 and GOO is bit 7.

Some examples where this would be useful are the state flags in a context
switch, kmalloc flags, and even permision flags in accessing files.

[
  v2 changes include:

  Frederic Weisbecker's idea of using a mask instead of bits,
  thus we can output GFP_KERNEL instead of GPF_WAIT|GFP_IO|GFP_FS.

  Li Zefan's idea of allowing the caller of __print_flags to add their
  own delimiter (or no delimiter) where we can get for file permissions
  rwx instead of r|w|x.
]

[
  v3 changes:

   Christoph Hellwig's idea of using an array instead of va_args.
]

[ Impact: better displaying of flags in trace output ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-26 20:25:22 +02:00
Ingo Molnar
329d876d6f perf_counter: Initialize ->oncpu properly
This shouldnt matter normally (and i have not seen any
misbehavior), because active counters always have a
proper ->oncpu value - but nevertheless initialize the
field properly to -1.

[ Impact: cleanup ]

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-26 09:54:13 +02:00
Zhaolei
0e907c9939 ftrace: clean up of using ftrace_event_enable_disable()
Always use ftrace_event_enable_disable() to enable/disable an event
so that we can factorize out the event toggling code.

[ Impact: factorize and cleanup event tracing code ]

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <4A14FDFE.2080402@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-26 03:30:31 +02:00
Zhaolei
b11c53e12f ftrace: Add task_comm support for trace_event
If we enable a trace event alone without any tracer running (such as
function tracer, sched switch tracer, etc...) it can't output enough
task command information.

We need to use the tracing_{start/stop}_cmdline_record() helpers
which are designed to keep track of cmdlines for any tasks that
were scheduled during the tracing.

Before this patch:
 # echo 1 > debugfs/tracing/events/sched/sched_switch/enable
 # cat debugfs/tracing/trace
 # tracer: nop
 #
 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
 #              | |       |          |         |
            <...>-2289  [000] 526276.724790: sched_switch: task bash:2289 [120] ==> sshd:2287 [120]
            <...>-2287  [000] 526276.725231: sched_switch: task sshd:2287 [120] ==> bash:2289 [120]
            <...>-2289  [000] 526276.725452: sched_switch: task bash:2289 [120] ==> sshd:2287 [120]
            <...>-2287  [000] 526276.727181: sched_switch: task sshd:2287 [120] ==> swapper:0 [140]
           <idle>-0     [000] 526277.032734: sched_switch: task swapper:0 [140] ==> events/0:5 [115]
            <...>-5     [000] 526277.032782: sched_switch: task events/0:5 [115] ==> swapper:0 [140]
 ...

After this patch:
 # tracer: nop
 #
 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
 #              | |       |          |         |
             bash-2269  [000] 527347.989229: sched_switch: task bash:2269 [120] ==> sshd:2267 [120]
             sshd-2267  [000] 527347.990960: sched_switch: task sshd:2267 [120] ==> bash:2269 [120]
             bash-2269  [000] 527347.991143: sched_switch: task bash:2269 [120] ==> sshd:2267 [120]
             sshd-2267  [000] 527347.992959: sched_switch: task sshd:2267 [120] ==> swapper:0 [140]
           <idle>-0     [000] 527348.531989: sched_switch: task swapper:0 [140] ==> events/0:5 [115]
         events/0-5     [000] 527348.532115: sched_switch: task events/0:5 [115] ==> swapper:0 [140]
 ...

Changelog:
v1->v2: Update Kconfig to select CONTEXT_SWITCH_TRACER in
        ENABLE_EVENT_TRACING
v2->v3: v2 can solve problem that was caused by config EVENT_TRACING
        alone, but when CONFIG_FTRACE is off and CONFIG_TRACING is
        selected by other config, compile fail happened again.
        This version solves it.

[ Impact: fix incomplete output of event tracing ]

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A14FDFE.2080402@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-26 03:03:21 +02:00
Lai Jiangshan
4f5359685a tracing: add trace_event_read_lock()
I found that there is nothing to protect event_hash in
ftrace_find_event(). Rcu protects the event hashlist
but not the event itself while we use it after its extraction
through ftrace_find_event().

This lack of a proper locking in this spot opens a race
window between any event dereferencing and module removal.

Eg:

--Task A--

print_trace_line(trace) {
  event = find_ftrace_event(trace)

--Task B--

trace_module_remove_events(mod) {
  list_trace_events_module(ev, mod) {
    unregister_ftrace_event(ev->event) {
      hlist_del(ev->event->node)
        list_del(....)
    }
  }
}
|--> module removed, the event has been dropped

--Task A--

  event->print(trace); // Dereferencing freed memory

If the event retrieved belongs to a module and this module
is concurrently removed, we may end up dereferencing a data
from a freed module.

RCU could solve this, but it would add latency to the kernel and
forbid tracers output callbacks to call any sleepable code.
So this fix converts 'trace_event_mutex' to a read/write semaphore,
and adds trace_event_read_lock() to protect ftrace_find_event().

[ Impact: fix possible freed memory dereference in ftrace ]

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4A114806.7090302@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-25 23:53:41 +02:00
Ingo Molnar
0127c3ea08 perf_counter: fix warning & lockup
- remove bogus warning
 - fix wakeup from NMI path lockup
 - also fix up whitespace noise in perf_counter.h

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.703093461@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-25 22:02:23 +02:00
Peter Zijlstra
a78ac32587 perf_counter: Generic per counter interrupt throttle
Introduce a generic per counter interrupt throttle.

This uses the perf_counter_overflow() quick disable to throttle a specific
counter when its going too fast when a pmu->unthrottle() method is provided
which can undo the quick disable.

Power needs to implement both the quick disable and the unthrottle method.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525153931.703093461@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-25 21:41:12 +02:00
Peter Zijlstra
10989fb245 perf_counter: Fix PERF_COUNTER_CONTEXT_SWITCHES for cpu counters
Ingo noticed that cpu counters had 0 context switches, even though
there was plenty scheduling on the cpu.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525124600.419025548@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-25 14:55:01 +02:00
Peter Zijlstra
6ab423e0ea perf_counter: Propagate inheritance failures down the fork() path
Fail fork() when we fail inheritance for some reason (-ENOMEM most likely).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525124600.324656474@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-25 14:55:01 +02:00
Peter Zijlstra
771d7cde14 perf_counter: Make pctrl() affect inherited counters too
Paul noted that the new ptcrl() didn't work on child counters.

Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090525124600.203151469@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-25 14:55:00 +02:00
Ingo Molnar
e4cbb4e3ac perf_counter: Move child perfcounter init to after scheduler init
Initialize a task's perfcounters (inherit from parent, etc.) after
the child task's scheduler fields have been initialized already.

[ Impact: cleanup ]

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-25 13:05:06 +02:00
Linus Torvalds
93c3248380 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs
2009-05-24 19:38:25 -07:00
James Bottomley
d5a877e8dd async: make sure independent async domains can't accidentally entangle
The problem occurs when async_synchronize_full_domain() is called when
the async_pending list is not empty.  This will cause lowest_running()
to return the cookie of the first entry on the async_pending list, which
might be nothing at all to do with the domain being asked for and thus
cause the domain synchronization to wait for an unrelated domain.   This
can cause a deadlock if domain synchronization is used from one domain
to wait for another.

Fix by running over the async_pending list to see if any pending items
actually belong to our domain (and return their cookies if they do).

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-24 13:38:41 -07:00
Rafael J. Wysocki
32bdfac546 PM: Do not hold dpm_list_mtx while disabling/enabling nonboot CPUs
We shouldn't hold dpm_list_mtx while executing
[disable|enable]_nonboot_cpus(), because theoretically this may lead
to a deadlock as shown by the following example (provided by Johannes
Berg):

CPU 3       CPU 2                     CPU 1
                                      suspend/hibernate
            something:
            rtnl_lock()               device_pm_lock()
                                       -> mutex_lock(&dpm_list_mtx)

            mutex_lock(&dpm_list_mtx)

linkwatch_work
 -> rtnl_lock()
                                      disable_nonboot_cpus()
                                       -> flush CPU 3 workqueue

Fortunately, device drivers are supposed to stop any activities that
might lead to the registration of new device objects way before
disable_nonboot_cpus() is called, so it shouldn't be necessary to
hold dpm_list_mtx over the entire late part of device suspend and
early part of device resume.

Thus, during the late suspend and the early resume of devices acquire
dpm_list_mtx only when dpm_list is going to be traversed and release
it right after that.

This patch is reported to fix the regressions tracked as
http://bugzilla.kernel.org/show_bug.cgi?id=13245.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Miles Lane <miles.lane@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
2009-05-24 21:15:07 +02:00
Ingo Molnar
a3862d3f81 perf_counter: Increase mmap limit
In a default 'perf top' run the tool will create a counter for
each online CPU. With enough CPUs this will eventually exhaust
the default limit.

So scale it up with the number of online CPUs.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-24 09:02:37 +02:00
Peter Zijlstra
475c557973 perf_counter: Remove perf_counter_context::nr_enabled
now that pctrl() no longer disables other people's counters,
remove the PMU cache code that deals with that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090523163013.032998331@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-24 08:24:30 +02:00
Peter Zijlstra
082ff5a276 perf_counter: Change pctrl() behaviour
Instead of en/dis-abling all counters acting on a particular
task, en/dis- able all counters we created.

[ v2: fix crash on first counter enable ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090523163012.916937244@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-24 08:24:08 +02:00
Peter Zijlstra
aa9c67f53d perf_counter: Simplify context cleanup
Use perf_counter_remove_from_context() to remove counters from
the context.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090523163012.796275849@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-23 19:37:47 +02:00
Peter Zijlstra
682076ae1d perf_counter: Sanitize context locking
Ensure we're consistent with the context locks.

 context->mutex
   context->lock
     list_{add,del}_counter();

so that either lock is sufficient to stabilize the context.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090523163012.618790733@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-23 19:37:46 +02:00
Peter Zijlstra
fccc714b31 perf_counter: Sanitize counter->mutex
s/counter->mutex/counter->child_mutex/ and make sure its only
used to protect child_list.

The usage in __perf_counter_exit_task() doesn't appear to be
problematic since ctx->mutex also covers anything related to fd
tear-down.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090523163012.533186528@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-23 19:37:45 +02:00
Peter Zijlstra
e220d2dcb9 perf_counter: Fix dynamic irq_period logging
We call perf_adjust_freq() from perf_counter_task_tick() which
is is called under the rq->lock causing lock recursion.
However, it's no longer required to be called under the
rq->lock, so remove it from under it.

Also, fix up some related comments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090523163012.476197912@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-23 19:37:44 +02:00
Paul Mundt
948cd52906 sparseirq: Allow early irq_desc allocation
Presently non-legacy IRQs have their irq_desc allocated with
kzalloc_node(). This assumes that all callers of irq_to_desc_node_alloc()
will be sufficiently late in the boot process that kmalloc is available.

While porting sparseirq support to sh this blew up immediately, as at the
time that we register the CPU's interrupt vector map only bootmem is
available. Check slab_is_available() to work out which path to use.

[ Impact: fix SH early boot crash with sparseirq enabled ]

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
LKML-Reference: <20090522014008.GA2806@linux-sh.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-23 14:55:24 +02:00
Jens Axboe
9bd7de51ee Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/ide/ide-io.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:28:35 +02:00
Jens Axboe
e4b636366c Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/block/hd.c
	drivers/block/mg_disk.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:25:34 +02:00
Paul Mackerras
564c2b210a perf_counter: Optimize context switch between identical inherited contexts
When monitoring a process and its descendants with a set of inherited
counters, we can often get the situation in a context switch where
both the old (outgoing) and new (incoming) process have the same set
of counters, and their values are ultimately going to be added together.
In that situation it doesn't matter which set of counters are used to
count the activity for the new process, so there is really no need to
go through the process of reading the hardware counters and updating
the old task's counters and then setting up the PMU for the new task.

This optimizes the context switch in this situation.  Instead of
scheduling out the perf_counter_context for the old task and
scheduling in the new context, we simply transfer the old context
to the new task and keep using it without interruption.  The new
context gets transferred to the old task.  This means that both
tasks still have a valid perf_counter_context, so no special case
is introduced when the old task gets scheduled in again, either on
this CPU or another CPU.

The equivalence of contexts is detected by keeping a pointer in
each cloned context pointing to the context it was cloned from.
To cope with the situation where a context is changed by adding
or removing counters after it has been cloned, we also keep a
generation number on each context which is incremented every time
a context is changed.  When a context is cloned we take a copy
of the parent's generation number, and two cloned contexts are
equivalent only if they have the same parent and the same
generation number.  In order that the parent context pointer
remains valid (and is not reused), we increment the parent
context's reference count for each context cloned from it.

Since we don't have individual fds for the counters in a cloned
context, the only thing that can make two clones of a given parent
different after they have been cloned is enabling or disabling all
counters with prctl.  To account for this, we keep a count of the
number of enabled counters in each context.  Two contexts must have
the same number of enabled counters to be considered equivalent.

Here are some measurements of the context switch time as measured with
the lat_ctx benchmark from lmbench, comparing the times obtained with
and without this patch series:

		-----Unmodified-----		With this patch series
Counters:	none	2 HW	4H+4S	none	2 HW	4H+4S

2 processes:
Average		3.44	6.45	11.24	3.12	3.39	3.60
St dev		0.04	0.04	0.13	0.05	0.17	0.19

8 processes:
Average		6.45	8.79	14.00	5.57	6.23	7.57
St dev		1.27	1.04	0.88	1.42	1.46	1.42

32 processes:
Average		5.56	8.43	13.78	5.28	5.55	7.15
St dev		0.41	0.47	0.53	0.54	0.57	0.81

The numbers are the mean and standard deviation of 20 runs of
lat_ctx.  The "none" columns are lat_ctx run directly without any
counters.  The "2 HW" columns are with lat_ctx run under perfstat,
counting cycles and instructions.  The "4H+4S" columns are lat_ctx run
under perfstat with 4 hardware counters and 4 software counters
(cycles, instructions, cache references, cache misses, task
clock, context switch, cpu migrations, and page faults).

[ Impact: performance optimization of counter context-switches ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-22 12:18:20 +02:00
Paul Mackerras
a63eaf34ae perf_counter: Dynamically allocate tasks' perf_counter_context struct
This replaces the struct perf_counter_context in the task_struct with
a pointer to a dynamically allocated perf_counter_context struct.  The
main reason for doing is this is to allow us to transfer a
perf_counter_context from one task to another when we do lazy PMU
switching in a later patch.

This has a few side-benefits: the task_struct becomes a little smaller,
we save some memory because only tasks that have perf_counters attached
get a perf_counter_context allocated for them, and we can remove the
inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
end up recompiling nearly everything whenever perf_counter.h changes.

The perf_counter_context structures are reference-counted and freed
when the last reference is dropped.  A context can have references
from its task and the counters on its task.  Counters can outlive the
task so it is possible that a context will be freed well after its
task has exited.

Contexts are allocated on fork if the parent had a context, or
otherwise the first time that a per-task counter is created on a task.
In the latter case, we set the context pointer in the task struct
locklessly using an atomic compare-and-exchange operation in case we
raced with some other task in creating a context for the subject task.

This also removes the task pointer from the perf_counter struct.  The
task pointer was not used anywhere and would make it harder to move a
context from one task to another.  Anything that needed to know which
task a counter was attached to was already using counter->ctx->task.

The __perf_counter_init_context function moves up in perf_counter.c
so that it can be called from find_get_context, and now initializes
the refcount, but is otherwise unchanged.

We were potentially calling list_del_counter twice: once from
__perf_counter_exit_task when the task exits and once from
__perf_counter_remove_from_context when the counter's fd gets closed.
This adds a check in list_del_counter so it doesn't do anything if
the counter has already been removed from the lists.

Since perf_counter_task_sched_in doesn't do anything if the task doesn't
have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
__perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
the case where the current task adds the first counter to itself and
thus creates a context for itself.

This also adds similar code to __perf_counter_enable to handle a
similar situation which can arise when the counters have been disabled
using prctl; that also leaves cpuctx->task_ctx = NULL.

[ Impact: refactor counter context management to prepare for new feature ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-22 12:18:19 +02:00
James Morris
2c9e703c61 Merge branch 'master' into next
Conflicts:
	fs/exec.c

Removed IMA changes (the IMA checks are now performed via may_open()).

Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 18:40:59 +10:00
Paul Mundt
5f8371cec9 Merge branches 'sh/stable-updates' and 'sh/sparseirq' 2009-05-22 13:29:37 +09:00
Ingo Molnar
34adc80622 perf_counter: Fix context removal deadlock
Disable the PMU globally before removing a counter from a
context. This fixes the following lockup:

[22081.741922] ------------[ cut here ]------------
[22081.746668] WARNING: at arch/x86/kernel/cpu/perf_counter.c:803 intel_pmu_handle_irq+0x9b/0x24e()
[22081.755624] Hardware name: X8DTN
[22081.758903] perfcounters: irq loop stuck!
[22081.762985] Modules linked in:
[22081.766136] Pid: 11082, comm: perf Not tainted 2.6.30-rc6-tip #226
[22081.772432] Call Trace:
[22081.774940]  <NMI>  [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e
[22081.781993]  [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e
[22081.788368]  [<ffffffff8104505c>] ? warn_slowpath_common+0x77/0xa3
[22081.794649]  [<ffffffff810450d3>] ? warn_slowpath_fmt+0x40/0x45
[22081.800696]  [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e
[22081.807080]  [<ffffffff814d1a72>] ? perf_counter_nmi_handler+0x3f/0x4a
[22081.813751]  [<ffffffff814d2d09>] ? notifier_call_chain+0x58/0x86
[22081.819951]  [<ffffffff8105b250>] ? notify_die+0x2d/0x32
[22081.825392]  [<ffffffff814d1414>] ? do_nmi+0x8e/0x242
[22081.830538]  [<ffffffff814d0f0a>] ? nmi+0x1a/0x20
[22081.835342]  [<ffffffff8117e102>] ? selinux_file_free_security+0x0/0x1a
[22081.842105]  [<ffffffff81018793>] ? x86_pmu_disable_counter+0x15/0x41
[22081.848673]  <<EOE>>  [<ffffffff81018f3d>] ? x86_pmu_disable+0x86/0x103
[22081.855512]  [<ffffffff8108fedd>] ? __perf_counter_remove_from_context+0x0/0xfe
[22081.862926]  [<ffffffff8108fcbc>] ? counter_sched_out+0x30/0xce
[22081.868909]  [<ffffffff8108ff36>] ? __perf_counter_remove_from_context+0x59/0xfe
[22081.876382]  [<ffffffff8106808a>] ? smp_call_function_single+0x6c/0xe6
[22081.882955]  [<ffffffff81091b96>] ? perf_release+0x86/0x14c
[22081.888600]  [<ffffffff810c4c84>] ? __fput+0xe7/0x195
[22081.893718]  [<ffffffff810c213e>] ? filp_close+0x5b/0x62
[22081.899107]  [<ffffffff81046a70>] ? put_files_struct+0x64/0xc2
[22081.905031]  [<ffffffff8104841a>] ? do_exit+0x1e2/0x6ef
[22081.910360]  [<ffffffff814d0a60>] ? _spin_lock_irqsave+0x9/0xe
[22081.916292]  [<ffffffff8104898e>] ? do_group_exit+0x67/0x93
[22081.921953]  [<ffffffff810489cc>] ? sys_exit_group+0x12/0x16
[22081.927759]  [<ffffffff8100baab>] ? system_call_fastpath+0x16/0x1b
[22081.934076] ---[ end trace 3a3936ce3e1b4505 ]---

And could potentially also fix the lockup reported by Marcelo Tosatti.

Also, print more debug info in case of a detected lockup.

[ Impact: fix lockup ]

Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-20 20:12:54 +02:00
Ming Lei
5537937696 ftrace: fix check for return value of register_module_notifier in event_trace_init
register_module_notifier() returns zero in the success case.
So fix the inverted fail case check in trace events modules
handler.

[ Impact: fix spurious warning on ftrace initialization]

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-20 19:23:11 +02:00
Peter Zijlstra
afedadf23a perf_counter: Optimize sched in/out of counters
Avoid a function call for !group counters by directly calling the counter
function.

[ Impact: micro-optimize the code ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090520102553.511933670@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-20 12:43:34 +02:00
Peter Zijlstra
b986d7ec0f perf_counter: Optimize disable of time based sw counters
Currently we call hrtimer_cancel() unconditionally on disable of time based
software counters. Avoid when possible.

[ Impact: micro-optimize the code ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090520102553.388185031@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-20 12:43:33 +02:00
Peter Zijlstra
26b119bc81 perf_counter: Log irq_period changes
For the dynamic irq_period code, log whenever we change the period so that
analyzing code can normalize the event flow.

[ Impact: add new feature to allow more precise profiling ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090520102553.298769743@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-20 12:43:33 +02:00
Peter Zijlstra
d7b629a34f perf_counter: Solve the rotate_ctx vs inherit race differently
Instead of disabling RR scheduling of the counters, use a different list
that does not get rotated to iterate the counters on inheritance.

[ Impact: cleanup, optimization ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090520102553.237504544@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-20 12:43:32 +02:00
Thomas Gleixner
2070887fde futex: fix restart in wait_requeue_pi
If the waiter has been requeued to the outer PI futex and is
interrupted by a signal and the thread handles the signal then
ERESTART_RESTARTBLOCK is changed to EINTR and the restart block is
discarded. That way we return an unexcpected EINTR to user space
instead of ending up in futex_lock_pi_restart.

But we do not need to restart the syscall because we know that the
condition has changed since we have been requeued. If we would simply
restart the syscall then we would drop out via the comparison of the
user space value with EWOULDBLOCK.

The user space side needs to handle EWOULDBLOCK anyway as the
enqueueing on the inner futex can race with a requeue/wake. So we can
simply return EWOULDBLOCK to user space which also signals that we did
not take the outer futex and let user space handle it in the same way
it has to handle the requeue/wake race.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-20 10:34:32 +02:00
Thomas Gleixner
1c840c1490 futex: fix restart for early wakeup in futex_wait_requeue_pi()
The futex_wait_requeue_pi op should restart unconditionally like
futex_lock_pi. The user of that function e.g. pthread_cond_wait can
not be interrupted so we do not care about the SA_RESTART flag of the
signal. Clean up the FIXMEs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-20 10:28:45 +02:00
Thomas Gleixner
c8b15a706d futex: cleanup error exit
Reuse the put_key_ref(key2) call in the exit path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-20 10:28:45 +02:00
Thomas Gleixner
521c180874 Merge branch 'core/urgent' into core/futexes
Merge reason: this branch was on an pre -rc1 base, merge it up to -rc6+
              to get the latest upstream fixes.

Conflicts:
	kernel/futex.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-20 09:02:28 +02:00
Ingo Molnar
c44d70a340 perf_counter: fix counter inheritance race
Context rotation should not occur when we are in the middle of
walking the counter list when inheriting counters ...

[ Impact: fix occasionally incorrect perf stat results ]

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-20 00:22:30 +02:00
Ingo Molnar
33b2fb303f perf_counter: fix counter freeing logic
Fix counter lifetime bugs which explain the crashes reported by
Marcelo Tosatti and Arnaldo Carvalho de Melo.

The new rule is: flushing + freeing is only done for a task's
own counters, never for other tasks.

[ Impact: fix crashes/lockups with inherited counters ]

Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-20 00:22:24 +02:00
Thomas Gleixner
64d1304a64 futex: setup writeable mapping for futex ops which modify user space data
The futex code installs a read only mapping via get_user_pages_fast()
even if the futex op function has to modify user space data. The
eventual fault was fixed up by futex_handle_fault() which walked the
VMA with mmap_sem held.

After the cleanup patches which removed the mmap_sem dependency of the
futex code commit 4dc5b7a36a49eff97050894cf1b3a9a02523717 (futex:
clean up fault logic) removed the private VMA walk logic from the
futex code. This change results in a stale RO mapping which is not
fixed up.

Instead of reintroducing the previous fault logic we set up the
mapping in get_user_pages_fast() read/write for all operations which
modify user space data. Also handle private futexes in the same way
and make the current unconditional access_ok(VERIFY_WRITE) depend on
the futex op.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
CC: stable@kernel.org
2009-05-19 23:36:52 +02:00
Stefan Raspl
fd51d251e4 blktrace: remove debugfs entries on bad path
debugfs directory entries for devices are not removed on some
of the failure pathes in do_blk_trace_setup().
One way to reproduce is to start blktrace on multiple devices
with insufficient Vmalloc space: Devices will fail with
a message like this:

	BLKTRACESETUP(2) /dev/sdu failed: 5/Input/output error

If so, the respective entries in debugfs
(e.g. /sys/kernel/debug/block/sdu) will remain and subsequent
attempts to start blktrace on the respective devices will not
succeed due to existing directories.

[ Impact: fix /debug/tracing file cleanup corner case ]

Signed-off-by: Stefan Raspl <stefan.raspl@linux.vnet.ibm.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
LKML-Reference: <4A1266CC.5040801@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-19 10:29:21 +02:00
Ingo Molnar
4200efd9ac sched: properly define the sched_group::cpumask and sched_domain::span fields
Properly document the variable-size structure tricks we are doing
wrt. struct sched_group and sched_domain, and use the field[0] GCC
extension instead of defining a vla array.

Dont use unions for this, as pointed out by Linus.

[ Impact: cleanup, un-confuse Sparse and LLVM ]

Reported-by: Jeff Garzik <jeff@garzik.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <alpine.LFD.2.01.0905180850110.3301@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-19 09:22:19 +02:00
Linus Torvalds
ee3af6ee77 Merge branches 'sched-fixes-for-linus-2' and 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix fallback sched_clock()'s offset when using jiffies

* 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: increase MAX_LOCKDEP_ENTRIES and MAX_LOCKDEP_CHAINS
2009-05-18 10:11:06 -07:00
Linus Torvalds
0130b2d701 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Append prompt in /debug/tracing/README file
  x86/function-graph: fix constraint for recording old return value
2009-05-18 09:15:41 -07:00
Ming Lei
24ed0c4bfc tracing: fix check for return value of register_module_notifier
return zero should be correct, so fix it.

[ Impact: eliminate incorrect syslog message ]

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: rostedt@goodmis.org
LKML-Reference: <1242545498-7285-1-git-send-email-tom.leiming@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-18 10:24:13 +02:00
Ingo Molnar
1079cac0f4 Merge commit 'v2.6.30-rc6' into tracing/core
Merge reason: we were on an -rc4 base, sync up to -rc6

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-18 10:15:35 +02:00
Ingo Molnar
dc3f81b129 Merge commit 'v2.6.30-rc6' into perfcounters/core
Merge reason: this branch was on an -rc4 base, merge it up to -rc6
              to get the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-18 07:37:49 +02:00
Linus Torvalds
86460103c4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: check sysdev_suspend(PMSG_FREEZE) return value
2009-05-17 11:46:22 -07:00
Ingo Molnar
0203026b58 perf_counter: fix threaded task exit
Flushing counters in __exit_signal() with irqs disabled is not
a good idea as perf_counter_exit_task() acquires mutexes. So
flush it before acquiring the tasklist lock.

(Note, we still need a fix for when the PID has been unhashed.)

[ Impact: fix crash with inherited counters ]

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-17 11:26:57 +02:00
Peter Zijlstra
856d56b9e5 perf_counter: Fix counter inheritance
Srivatsa Vaddagiri reported that a Java workload triggers this
warning in kernel/exit.c:

   WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list));

Add the inherited counter propagation on self-detach, this could
cause counter leaks and incomplete stats in threaded code like
the below:

  #include <pthread.h>
  #include <unistd.h>

  void *thread(void *arg)
  {
          sleep(5);
          return NULL;
  }

  void main(void)
  {
          pthread_t thr;
          pthread_create(&thr, NULL, thread, NULL);
  }

Reported-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-17 07:52:24 +02:00
Peter Zijlstra
8bc2095951 perf_counter: Fix inheritance cleanup code
Clean up code that open-coded the list_{add,del}_counter() code in
__perf_counter_exit_task() which consequently diverged. This could
lead to software counter crashes.

Also, fold the ctx->nr_counter inc/dec into those functions and clean
up some of the related code.

[ Impact: fix potential sw counter crash, cleanup ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-17 07:52:23 +02:00
Linus Torvalds
0f6f49a8cd Fix caller information for warn_slowpath_null
Ian Campbell noticed that since "Eliminate thousands of warnings with
gcc 3.2 build" (commit 57adc4d2db) all
WARN_ON()'s currently appear to come from warn_slowpath_null(), eg:

  WARNING: at kernel/softirq.c:143 warn_slowpath_null+0x1c/0x20()

because now that warn_slowpath_null() is in the call path, the
__builtin_return_address(0) returns that, rather than the place that
caused the warning.

Fix this by splitting up the warn_slowpath_null/fmt cases differently,
using a common helper function, and getting the return address in the
right place.  This also happens to avoid the unnecessary stack usage for
the non-stdargs case, and just generally cleans things up.

Make the function name printout use %pS while at it.

Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-16 13:41:28 -07:00
Bjorn Helgaas
4484079d51 PM: check sysdev_suspend(PMSG_FREEZE) return value
Check the return value of sysdev_suspend().  I think this was a typo.
Without this change, the following "if" check is always false.
I also changed the error message so it's distinguishable from the
similar message a few lines above.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-05-15 23:30:50 +02:00
GeunSik Lim
88fc86c283 tracing: Append prompt in /debug/tracing/README file
append prompt in /debug/tracing/README file.

This is trivial issue. Fix typo Mini Howto file(README) for ftrace.

[ Impact: cleanup ]

Signed-off-by: GeunSik Lim <geunsik.lim@samsung.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: williams <williams@redhat.com>
LKML-Reference: <1242289418.31161.45.camel@centos51>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 19:43:22 +02:00
Linus Torvalds
ade385e4d1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
  kgdb: gdb documentation fix
  kgdb,i386: use address that SP register points to in the exception frame
  sysrq, intel_fb: fix sysrq g collision
2009-05-15 08:06:45 -07:00
Paul Mackerras
9d23a90a67 perf_counter: allow arch to supply event misc flags and instruction pointer
At present the values we put in overflow events for the misc
flags indicating processor mode and the instruction pointer are
obtained using the standard user_mode() and
instruction_pointer() functions. Those functions tell you where
the performance monitor interrupt was taken, which might not be
exactly where the counter overflow occurred, for example
because interrupts were disabled at the point where the
overflow occurred, or because the processor had many
instructions in flight and chose to complete some more
instructions beyond the one that caused the counter overflow.

Some architectures (e.g. powerpc) can supply more precise
information about where the counter overflow occurred and the
processor mode at that point.  This introduces new functions,
perf_misc_flags() and perf_instruction_pointer(), which arch
code can override to provide more precise information if
available.  They have default implementations which are
identical to the existing code.

This also adds a new misc flag value,
PERF_EVENT_MISC_HYPERVISOR, for the case where a counter
overflow occurred in the hypervisor.  We encode the processor
mode in the 2 bits previously used to indicate user or kernel
mode; the values for user and kernel mode are unchanged and
hypervisor mode is indicated by both bits being set.

[ Impact: generalize perfcounter core facilities ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18956.1272.818511.561835@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 16:38:56 +02:00
Peter Zijlstra
2e569d3672 perf_counter: frequency based adaptive irq_period, 32-bit fix
fix:

  kernel/built-in.o: In function `perf_counter_alloc':
  perf_counter.c:(.text+0x7ddc7): undefined reference to `__udivdi3'

[ Impact: build fix on 32-bit systems ]

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <1242394667.6642.1887.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 15:40:25 +02:00
Thomas Gleixner
2d02494f5a sched, timers: cleanup avenrun users
avenrun is an rough estimate so we don't have to worry about
consistency of the three avenrun values. Remove the xtime lock
dependency and provide a function to scale the values. Cleanup the
users.

[ Impact: cleanup ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2009-05-15 15:32:45 +02:00
Thomas Gleixner
dce48a84ad sched, timers: move calc_load() to scheduler
Dimitri Sivanich noticed that xtime_lock is held write locked across
calc_load() which iterates over all online CPUs. That can cause long
latencies for xtime_lock readers on large SMP systems. 

The load average calculation is an rough estimate anyway so there is
no real need to protect the readers vs. the update. It's not a problem
when the avenrun array is updated while a reader copies the values.

Instead of iterating over all online CPUs let the scheduler_tick code
update the number of active tasks shortly before the avenrun update
happens. The avenrun update itself is handled by the CPU which calls
do_timer().

[ Impact: reduce xtime_lock write locked section ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2009-05-15 15:32:45 +02:00
Peter Zijlstra
60db5e09c1 perf_counter: frequency based adaptive irq_period
Instead of specifying the irq_period for a counter, provide a target interrupt
frequency and dynamically adapt the irq_period to match this frequency.

[ Impact: new perf-counter attribute/feature ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090515132018.646195868@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 15:26:56 +02:00
Peter Zijlstra
789f90fcf6 perf_counter: per user mlock gift
Instead of a per-process mlock gift for perf-counters, use a
per-user gift so that there is less of a DoS potential.

[ Impact: allow less worst-case unprivileged memory consumption ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090515132018.496182835@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 15:26:56 +02:00
Peter Zijlstra
548e1ddf25 perf_counter: remove perf_disable/enable exports
Now that ACPI idle doesn't use it anymore, remove the exports.

[ Impact: remove dead code/data ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <20090515132018.429826617@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 15:26:55 +02:00
Thomas Gleixner
f1a11e0576 futex: remove the wait queue
The waitqueue which is used in struct futex_q is a leftover from the
futexfd implementation. There is no need to use a waitqueue at all, as
the waiting task is the only user of it. The waitqueue just adds
additional locking and a loop in the wake up path which both can be
avoided.

We have already a task reference in struct futex_q which is used for
PI futexes. Use it for normal futexes as well and just wake up the
task directly.

The logic of signalling the futex wakeup via setting q->lock_ptr to
NULL is kept with the difference that we set it NULL before doing the
wakeup. This opens an exit race window vs. a non futex wake up of the
to be woken up task, which we prevent with get_task_struct /
put_task_struct on the waiter.

[ Impact: simplification ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-15 15:24:18 +02:00
Jason Wessel
364b5b7b1d sysrq, intel_fb: fix sysrq g collision
Commit 79e539453b introduced a
regression where you cannot use sysrq 'g' to enter kgdb.  The solution
is to move the intel fb sysrq over to V for video instead of G for
graphics.  The SMP VOYAGER code to register for the sysrq-v is not
anywhere to be found in the mainline kernel, so the comments in the
code were cleaned up as well.

This patch also cleans up the sysrq definitions for kgdb to make it
generic for the kernel debugger, such that the sysrq 'g' can be used
in the future to enter a gdbstub or another kernel debugger.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2009-05-15 07:56:24 -05:00
Jens Axboe
cd17cbfda0 Revert "mm: add /proc controls for pdflush threads"
This reverts commit fafd688e4c.

Work is progressing to switch away from pdflush as the process backing
for flushing out dirty data. So it seems pointless to add more knobs
to control pdflush threads. The original author of the patch did not
have any specific use cases for adding the knobs, so we can easily
revert this before 2.6.30 to avoid having to maintain this API
forever.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-15 11:32:24 +02:00
Peter Zijlstra
9e35ad388b perf_counter: Rework the perf counter disable/enable
The current disable/enable mechanism is:

	token = hw_perf_save_disable();
	...
	/* do bits */
	...
	hw_perf_restore(token);

This works well, provided that the use nests properly. Except we don't.

x86 NMI/INT throttling has non-nested use of this, breaking things. Therefore
provide a reference counter disable/enable interface, where the first disable
disables the hardware, and the last enable enables the hardware again.

[ Impact: refactor, simplify the PMU disable/enable logic ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 09:47:02 +02:00
Peter Zijlstra
53020fe81e perf_counter: Fix perf_output_copy() WARN to account for overflow
The simple reservation test in perf_output_copy() failed to take
unsigned int overflow into account, fix this.

[ Impact: fix false positive warning with more than 4GB of profiling data ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 09:46:59 +02:00
Li Zefan
5872144f64 tracing/filters: fix off-by-one bug
We should leave the last slot for the ending '\0'.

[ Impact: fix possible crash when the length of an operand is 128 ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A0CDC8C.30602@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-14 23:55:12 -04:00
Li Zefan
8cd995b6de tracing/filters: add missing unlock in a failure path
[ Impact: fix deadlock in a rare case we fail to allocate memory ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A0CDC6F.7070200@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-14 23:55:10 -04:00
Steven Rostedt
1ec7c4849c tracing: stop stack trace on first empty entry
The stack tracer stores eight entries in the ring buffer when an event
traces the stack. The output outputs all eight entries regardless of
how many entries were recorded.

This patch breaks out of the loop when a null entry is discovered.

[ Impact: only print the stack that is recorded ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-14 23:40:06 -04:00
Arun R Bharadwaj
eea08f32ad timers: Logic to move non pinned timers
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.

While migrating hrtimers, care should be taken to check if migrating
a hrtimer would result in a latency or not. So we compare the expiry of the
hrtimer with the next timer interrupt on the target cpu and migrate the
hrtimer only if it expires *after* the next interrupt on the target cpu.
So, added a clockevents_get_next_event() helper function to return the
next_event on the target cpu's clock_event_device.

[ tglx: cleanups and simplifications ]

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-13 16:52:42 +02:00
Arun R Bharadwaj
cd1bb94b4a timers: /proc/sys sysctl hook to enable timer migration
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

This patch creates the /proc/sys sysctl interface at
/proc/sys/kernel/timer_migration

Timer migration is enabled by default.

To disable timer migration, when CONFIG_SCHED_DEBUG = y,

echo 0 > /proc/sys/kernel/timer_migration

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-13 16:52:42 +02:00
Arun R Bharadwaj
5c333864a6 timers: Identifying the existing pinned timers
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn

[ tglx: fixup the hrtimer pinned mode ]

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-13 16:52:42 +02:00
Arun R Bharadwaj
597d027573 timers: Framework for identifying pinned timers
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

This patch creates a new framework for identifying cpu-pinned timers
and hrtimers.

This framework is needed because pinned timers are expected to fire on
the same CPU on which they are queued. So it is essential to identify
these and not migrate them, in case there are any.

For regular timers, the currently existing add_timer_on() can be used
queue pinned timers and subsequently mod_timer_pinned() can be used
to modify the 'expires' field.

For hrtimers, new modes HRTIMER_ABS_PINNED and HRTIMER_REL_PINNED are
added to queue cpu-pinned hrtimer.

[ tglx: use .._PINNED mode argument instead of creating tons of new
functions ]

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-13 16:52:42 +02:00
Steven Rostedt
39a2eddb9b genirq: fix comment to say IRQ_WAKE_THREAD
Trying to implement a driver to use threaded irqs, I was confused when the 
return value to use that was described in the comment above 
request_threaded_irq was not defined.

Turns out that the enum is IRQ_WAKE_THREAD where as the comment said 
IRQ_THREAD_WAKE.

[Impact: do not confuse developers with wrong comments ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <alpine.DEB.2.00.0905121431020.13338@gandalf.stny.rr.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-12 22:11:29 +02:00
Ingo Molnar
d80c19df5f lockdep: increase MAX_LOCKDEP_ENTRIES and MAX_LOCKDEP_CHAINS
Now that lockdep coverage has increased it has become easier to
run out of entries:

[   21.401387] BUG: MAX_LOCKDEP_ENTRIES too low!
[   21.402007] turning off the locking correctness validator.
[   21.402007] Pid: 1555, comm: S99local Not tainted 2.6.30-rc5-tip #2
[   21.402007] Call Trace:
[   21.402007]  [<ffffffff81069789>] add_lock_to_list+0x53/0xba
[   21.402007]  [<ffffffff810eb615>] ? lookup_mnt+0x19/0x53
[   21.402007]  [<ffffffff8106be14>] check_prev_add+0x14b/0x1c7
[   21.402007]  [<ffffffff8106c304>] validate_chain+0x474/0x52a
[   21.402007]  [<ffffffff8106c6fc>] __lock_acquire+0x342/0x3c7
[   21.402007]  [<ffffffff8106c842>] lock_acquire+0xc1/0xe5
[   21.402007]  [<ffffffff810eb615>] ? lookup_mnt+0x19/0x53
[   21.402007]  [<ffffffff8153aedc>] _spin_lock+0x31/0x66

Double the size - as we've done in the past.

[ Impact: allow lockdep to cover more locks ]

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-12 19:59:52 +02:00
Paul Mackerras
e758a33d6f perf_counter: call hw_perf_save_disable/restore around group_sched_in
I noticed that when enabling a group via the PERF_COUNTER_IOC_ENABLE
ioctl on the group leader, the counters weren't enabled and counting
immediately on return from the ioctl, but did start counting a little
while later (presumably after a context switch).

The reason was that __perf_counter_enable calls group_sched_in which
calls hw_perf_group_sched_in, which on powerpc assumes that the caller
has called hw_perf_save_disable already.  Until commit 46d686c6
("perf_counter: put whole group on when enabling group leader") it was
true that all callers of group_sched_in had called
hw_perf_save_disable first, and the powerpc hw_perf_group_sched_in
relies on that (there isn't an x86 version).

This fixes the problem by putting calls to hw_perf_save_disable /
hw_perf_restore around the calls to group_sched_in and
counter_sched_in in __perf_counter_enable.  Having the calls to
hw_perf_save_disable/restore around the counter_sched_in call is
harmless and makes this call consistent with the other call sites
of counter_sched_in, which have all called hw_perf_save_disable first.

[ Impact: more precise counter group disable/enable functionality ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <18953.25733.53359.147452@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-12 15:31:06 +02:00
Ingo Molnar
6cda3eb62e Merge branch 'x86/apic' into irq/numa
Merge reason: both topics modify the APIC code but were able to do it in
              parallel so far. An upcoming patch generates a conflict so
              merge them to avoid the conflict.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-12 12:17:36 +02:00
Steven Rostedt
168b6b1d05 ring-buffer: move code around to remove some branches
This is a bit of micro-optimizations. But since the ring buffer is used
in tracing every function call, it is an extreme hot path. Every nanosecond
counts.

This change shows over 5% improvement in the ring-buffer-benchmark.

[ Impact: more efficient code ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-11 23:33:06 -04:00
Steven Rostedt
88eb012536 ring-buffer: use internal time stamp function
The ring_buffer_time_stamp that is exported adds a little more overhead
than is needed for using it internally. This patch adds an internal
timestamp function that can be inlined (a single line function)
and used internally for the ring buffer.

[ Impact: a little less overhead to the ring buffer ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-11 23:14:03 -04:00
Steven Rostedt
0f0c85fc80 ring-buffer: small optimizations
Doing some small changes in the fast path of the ring buffer recording
saves over 3% in the ring-buffer-benchmark test.

[ Impact: a little faster ring buffer recording ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-11 23:12:34 -04:00
H. Peter Anvin
5031296c57 x86: add extension fields for bootloader type and version
A long ago, in days of yore, it all began with a god named Thor.
There were vikings and boats and some plans for a Linux kernel
header.  Unfortunately, a single 8-bit field was used for bootloader
type and version.  This has generally worked without *too* much pain,
but we're getting close to flat running out of ID fields.

Add extension fields for both type and version.  The type will be
extended if it the old field is 0xE; the version is a simple MSB
extension.

Keep /proc/sys/kernel/bootloader_type containing
(type << 4) + (ver & 0xf) for backwards compatiblity, but also add
/proc/sys/kernel/bootloader_version which contains the full version
number.

[ Impact: new feature to support more bootloaders ]

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-05-11 17:45:06 -07:00
Steven Rostedt
be957c447f ring-buffer: move calculation of event length
The event length is calculated and passed in to rb_reserve_next_event
in two different locations. Having rb_reserve_next_event do the
calculations directly makes only one location to do the change and
causes the calculation to be inlined by gcc.

Before:
   text    data     bss     dec     hex filename
  16538      24      12   16574    40be kernel/trace/ring_buffer.o

After:
   text    data     bss     dec     hex filename
  16490      24      12   16526    408e kernel/trace/ring_buffer.o

[ Impact: smaller more efficient code ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-11 14:42:53 -04:00
Steven Rostedt
1cd8d73589 ring-buffer: remove type parameter from rb_reserve_next_event
The rb_reserve_next_event is only called for the data type (type = 0).
There is no reason to pass in the type to the function.

Before:
   text    data     bss     dec     hex filename
  16554      24      12   16590    40ce kernel/trace/ring_buffer.o

After:
   text    data     bss     dec     hex filename
  16538      24      12   16574    40be kernel/trace/ring_buffer.o

[ Impact: cleaner, smaller and slightly more efficient code ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-11 14:19:00 -04:00
Steven Rostedt
d988ff94c1 ring-buffer: check for divide by zero in ring-buffer-benchmark
Although we check if "missed" is not zero, we divide by hit + missed,
and the addition can possible overflow and become a divide by zero.

This patch checks for this case, and will report it when it happens
then modify "hit" to make the calculation be non zero.

[ Impact: prevent possible divide by zero in ring-buffer-benchmark ]

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-11 13:22:26 -04:00
Steven Rostedt
5a772b2b3c ring-buffer: replace constants with time macros in ring-buffer-benchmark
The use of numeric constants is discouraged. It is cleaner and more
descriptive to use macros for constant time conversions.

This patch also removes an extra new line.

[ Impact: more descriptive time conversions ]

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-11 13:22:26 -04:00
Ingo Molnar
7961386fe9 Merge commit 'v2.6.30-rc5' into sched/core
Merge reason: sched/core was on .30-rc1 before, update to latest fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-11 12:59:37 +02:00
Li Zefan
0498625793 blktrace: pdu_buf of pc events should be unsigned
I got this:
  8,0    1   305.417782332  2037  I   R 32 (ffffff9e 10 00 ...) [bash]

It should be:
  8,0    1   305.417782332  2037  I   R 32 (9e 10 00 ...) [bash]

[ Impact: fix output of pc events ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A07C6B3.9080802@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-11 12:25:50 +02:00
Paul Mackerras
615a3f1e05 perf_counter: call atomic64_set for counter->count
A compile warning triggered because we are calling
atomic_set(&counter->count). But since counter->count
is an atomic64_t, we have to use atomic64_set.

So the count can be set short, resulting in the reset ioctl
only resetting the low word.

[ Impact: clear counter properly during the reset ioctl ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <18951.48285.270311.981806@drongo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-11 12:10:54 +02:00
Paul Mackerras
a08b159fc2 perf_counter: don't count scheduler ticks as context switches
The context-switch software counter gives inflated values at present
because each scheduler tick and each process-wide counter
enable/disable prctl gets counted as a context switch.

This happens because perf_counter_task_tick, perf_counter_task_disable
and perf_counter_task_enable all call perf_counter_task_sched_out,
which calls perf_swcounter_event to record a context switch event.

This fixes it by introducing a variant of perf_counter_task_sched_out
with two underscores in front for internal use within the perf_counter
code, and makes perf_counter_task_{tick,disable,enable} call it.  This
variant doesn't record a context switch event, and takes a struct
perf_counter_context *.  This adds the new variant rather than
changing the behaviour or interface of perf_counter_task_sched_out
because that is called from other code.

[ Impact: fix inflated context-switch event counts ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <18951.48034.485580.498953@drongo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-11 12:10:53 +02:00
Paul Mackerras
6751b71ea2 perf_counter: Put whole group on when enabling group leader
Currently, if you have a group where the leader is disabled and there
are siblings that are enabled, and then you enable the leader, we only
put the leader on the PMU, and not its enabled siblings.  This is
incorrect, since the enabled group members should be all on or all off
at any given point.

This fixes it by adding a call to group_sched_in in
__perf_counter_enable in the case where we're enabling a group leader.

To avoid the need for a forward declaration this also moves
group_sched_in up before __perf_counter_enable.  The actual content of
group_sched_in is unchanged by this patch.

[ Impact: fix bug in counter enable code ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <18951.34946.451546.691693@drongo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-11 12:10:53 +02:00
Tejun Heo
2e46e8b27a block: drop request->hard_* and *nr_sectors
struct request has had a few different ways to represent some
properties of a request.  ->hard_* represent block layer's view of the
request progress (completion cursor) and the ones without the prefix
are supposed to represent the issue cursor and allowed to be updated
as necessary by the low level drivers.  The thing is that as block
layer supports partial completion, the two cursors really aren't
necessary and only cause confusion.  In addition, manual management of
request detail from low level drivers is cumbersome and error-prone at
the very least.

Another interesting duplicate fields are rq->[hard_]nr_sectors and
rq->{hard_cur|current}_nr_sectors against rq->data_len and
rq->bio->bi_size.  This is more convoluted than the hard_ case.

rq->[hard_]nr_sectors are initialized for requests with bio but
blk_rq_bytes() uses it only for !pc requests.  rq->data_len is
initialized for all request but blk_rq_bytes() uses it only for pc
requests.  This causes good amount of confusion throughout block layer
and its drivers and determining the request length has been a bit of
black magic which may or may not work depending on circumstances and
what the specific LLD is actually doing.

rq->{hard_cur|current}_nr_sectors represent the number of sectors in
the contiguous data area at the front.  This is mainly used by drivers
which transfers data by walking request segment-by-segment.  This
value always equals rq->bio->bi_size >> 9.  However, data length for
pc requests may not be multiple of 512 bytes and using this field
becomes a bit confusing.

In general, having multiple fields to represent the same property
leads only to confusion and subtle bugs.  With recent block low level
driver cleanups, no driver is accessing or manipulating these
duplicate fields directly.  Drop all the duplicates.  Now rq->sector
means the current sector, rq->data_len the current total length and
rq->bio->bi_size the current segment length.  Everything else is
defined in terms of these three and available only through accessors.

* blk_recalc_rq_sectors() is collapsed into blk_update_request() and
  now handles pc and fs requests equally other than rq->sector update.
  This means that now pc requests can use partial completion too (no
  in-kernel user yet tho).

* bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
  now uses byte count as the primary data length.

* blk_rq_pos() is now guranteed to be always correct.  In-block users
  converted.

* blk_rq_bytes() is now guaranteed to be always valid as is
  blk_rq_sectors().  In-block users converted.

* blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
  More convenient one is used.

* blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
  pointer to request.

[ Impact: API cleanup, single way to represent one property of a request ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:54 +02:00
Tejun Heo
5b93629b45 block: implement blk_rq_pos/[cur_]sectors() and convert obvious ones
Implement accessors - blk_rq_pos(), blk_rq_sectors() and
blk_rq_cur_sectors() which return rq->hard_sector, rq->hard_nr_sectors
and rq->hard_cur_sectors respectively and convert direct references of
the said fields to the accessors.

This is in preparation of request data length handling cleanup.

Geert	: suggested adding const to struct request * parameter to accessors
Sergei	: spotted error in patch description

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Ackec-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:53 +02:00
David Howells
5e751e992f CRED: Rename cred_exec_mutex to reflect that it's a guard against ptrace
Rename cred_exec_mutex to reflect that it's a guard against foreign
intervention on a process's credential state, such as is made by ptrace().  The
attachment of a debugger to a process affects execve()'s calculation of the new
credential state - _and_ also setprocattr()'s calculation of that state.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-11 08:15:36 +10:00
Al Viro
6f5bbff9a1 Convert obvious places to deactivate_locked_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Ron
92d23f703c sched: Fix fallback sched_clock()'s offset when using jiffies
Account for the initial offset to the jiffy count.

[ Impact: fix printk timestamps on architectures using fallback sched_clock() ]

Signed-off-by: Ron Lee <ron@debian.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-09 10:08:19 +02:00
Masami Hiramatsu
201517a7f3 kprobes: fix to use text_mutex around arm/disarm kprobe
Fix kprobes to lock text_mutex around some arch_arm/disarm_kprobe() which
are newly added by commit de5bd88d5a.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-08 16:23:48 -07:00
Steven Rostedt
4671c79408 tracing: add trace_set_clr_event to export event enabling function
Other parts of the kernel may need to be able to enable or disable
specific events. Especially parts that create trace events.

[ Impact: allow enabling of trace events by those that create the event ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-08 16:30:26 -04:00
Steven Rostedt
29f93943d1 tracing: initialize return value for __ftrace_set_clr_event
Commit 8f31bfe538
tracing/events: clean up for ftrace_set_clr_event()

Moved out the code for ftrace_set_clr_event into a helper funciton but
did not initialize the return value. As a result, we do not warn about
a typo in the echoing of events in set_event.

This patch restores the old warning:

 # echo foobar > set_event
-bash: echo: write error: Invalid argument

[ Impact: restore warning of invalid entries to set_event ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-08 16:06:47 -04:00
Peter Zijlstra
f370e1e2f1 perf_counter: add PERF_RECORD_CPU
Allow recording the CPU number the event was generated on.

RFC: this leaves a u32 as reserved, should we fill in the
     node_id() there, or leave this open for future extention,
     as userspace can already easily do the cpu->node mapping
     if needed.

[ Impact: extend perfcounter output record format ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090508170029.008627711@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-08 20:36:59 +02:00
Peter Zijlstra
a85f61abe1 perf_counter: add PERF_RECORD_CONFIG
Much like CONFIG_RECORD_GROUP records the hw_event.config to
identify the values, allow to record this for all counters.

[ Impact: extend perfcounter output record format ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090508170028.923228280@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-08 20:36:58 +02:00
Peter Zijlstra
3df5edad87 perf_counter: rework ioctl()s
Corey noticed that ioctl()s on grouped counters didn't work on
the whole group. This extends the ioctl() interface to take a
second argument that is interpreted as a flags field. We then
provide PERF_IOC_FLAG_GROUP to toggle the behaviour.

Having this flag gives the greatest flexibility, allowing you
to individually enable/disable/reset counters in a group, or
all together.

[ Impact: fix group counter enable/disable semantics ]

Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090508170028.837558214@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-08 20:36:58 +02:00
Peter Zijlstra
7fc23a5380 perf_counter: optimize perf_counter_task_tick()
perf_counter_task_tick() does way too much work to find out
there's nothing to do. Provide an easy short-circuit for the
normal case where there are no counters on the system.

[ Impact: micro-optimization ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090508170028.750619201@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-08 20:36:57 +02:00
Li Zefan
c142b15dc5 tracing/events: simplify system_enable_read()
A smarter way to figure out the output of an enable file.

[ Impact: clean up ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A0399A5.2080603@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-08 14:00:36 +02:00
Li Zefan
8f31bfe538 tracing/events: clean up for ftrace_set_clr_event()
Add a helper function __ftrace_set_clr_event(), and replace some
ftrace_set_clr_event() calls with this helper, thus we don't need any
kstrdup() or kmalloc().

As a side effect, this patch fixes an issue in self tests code, which is
similar to the one fixed in commit d6bf81ef0f
("tracing: append ":*" to internal setting of system events")

It's a small issue and won't cause any bug in fact, but we should do things
right anyway.

[ Impact: prevent spurious event-enabling in tracing self-tests ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A03998E.3020503@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-08 14:00:35 +02:00
Ingo Molnar
f066a15533 Merge branch 'x86/urgent' into x86/xen
Conflicts:
	arch/frv/include/asm/pgtable.h
	arch/x86/include/asm/required-features.h
	arch/x86/xen/mmu.c

Merge reason: x86/xen was on a .29 base still, move it to a fresher
              branch and pick up Xen fixes as well, plus resolve
              conflicts

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-08 10:50:00 +02:00
James Morris
d254117099 Merge branch 'master' into next 2009-05-08 17:56:47 +10:00
Steven Rostedt
74f4fd2166 ring-buffer: change WARN_ON from checking preempt_count to preemptible
There's a WARN_ON in the ring buffer code that makes sure preemption
is disabled. It checks "!preempt_count()". But when CONFIG_PREEMPT is not
enabled, preempt_count() is always zero, and this will trigger the warning.

[ Impact: prevent false warning on non preemptible kernels ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-07 20:01:11 -04:00
Steven Rostedt
7da3046d6c ring-buffer: add total count in ring-buffer-benchmark
It is nice to see the overhead of the benchmark test when tracing is
disabled. That is, we turn off the ring buffer just to see what the
cost of running the loop that calls into the ring buffer is.

Currently, if no entries wer made, we get 0. This is not informative.
This patch changes it to check if we had any "missed" (non recorded)
events. If so, a total count is also reported.

[ Impact: evaluate the over head of the ring buffer benchmark test ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-07 19:52:20 -04:00
Steven Rostedt
0574ea421b ring-buffer: only periodically call cond_resched to ring-buffer-benchmark
Calling cond_resched at every iteration of the loop adds a bit of
overhead to the benchmark.

This patch does two things.

1) only calls cond-resched when CONFIG_PREEMPT is not enabled
2) only calls cond-resched after so many traces has been performed.

[ Impact: less overhead to the ring-buffer-benchmark ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-07 14:20:28 -04:00
Steven Rostedt
65b7724204 tracing: have menu default enabled when kernel debug is configured
Tracing can be very helpful to debug the kernel. When DEBUG_KERNEL is
enabled it is nice to enable the trace menu as well.

This patch only make the tracing menu enabled by default, it does not
make any of the tracers enabled. And the menu is only enabled by
default if DEBUG_KERNEL is enabled.

[ Impact: show tracing options to those debugging the kernel ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-07 12:49:27 -04:00
Steven Rostedt
d6bf81ef0f tracing: append ":*" to internal setting of system events
The system enabling of events uses the same code as the set_event file.
It passes in the name of the system to the parser and that will enable
all the events that has that system as a name.

The problem is that it will also enable events with the same name as the
system.

If you have system name foo, and system name bar, but within the system
bar, there exists an event called foo. By setting the system name foo,
you will also be enabling the event foo in the system bar. This is not
an expected result.

The solution is to pass in "foo:*", which will only enable the system
foo and not events called foo.

[ Impact: prevent accidental enabling of events with same name as a system ]

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-07 11:49:35 -04:00
Steven Rostedt
29c8000ee7 ring-buffer: remove complex calculations in ring-buffer-test
Ingo Molnar thought that the code to calculate the time in cond_resched
is a bit too ugly and is not needed. This patch removes it and replaces
it with a simple call to cond_resched. I kept the comment that explains
the reason for the cond_resched.

[ Impact: remove ugly code ]

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-07 11:16:18 -04:00
Ingo Molnar
0ad5d703c6 Merge branch 'tracing/hw-branch-tracing' into tracing/core
Merge reason: this topic is ready for upstream now. It passed
              Oleg's review and Andrew had no further mm/*
              objections/observations either.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-07 13:36:22 +02:00
Ingo Molnar
44347d947f Merge branch 'linus' into tracing/core
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
              on a handful of tracing fixes present in .30-rc5-almost.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-07 11:17:34 +02:00
Li Zefan
d94fc523f3 tracing/events: fix concurrent access to ftrace_events list, fix
In filter_add_subsystem_pred() we should release event_mutex before
calling filter_free_subsystem_preds(), since both functions hold
event_mutex.

[ Impact: fix deadlock when writing invalid pred into subsystem filter ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: tzanussi@gmail.com
Cc: a.p.zijlstra@chello.nl
Cc: fweisbec@gmail.com
Cc: rostedt@goodmis.org
LKML-Reference: <4A028993.7020509@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-07 10:07:28 +02:00
Frederic Weisbecker
5928c3cc0f tracing/filters: support for operator reserved characters in strings
When we set a filter for an event, such as:

echo "name == my_lock_name" > \
	/debug/tracing/events/lockdep/lock_acquired/filter

then the following order of token type is parsed:

- space
- operator
- parentheses
- operand

Because the operators and parentheses have a higher precedence
than the operand characters, which is normal, then we can't
use any string containing such special characters:

()=<>!&|

To get this support and also avoid ambiguous intepretation from
the parser or the human, we can do it using double quotes so that
we keep the usual languages habits.

Then after this patch you can still declare string condition like
before:

echo name == myname

But if you want to compare against a string containing an operator
character, you can use double quotes:

echo 'name == "&myname"'

Don't forget to include the whole expression into single quotes or
the double ones will be eaten by echo.

[ Impact: support strings with special characters for tracing filters ]

Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Zhaolei <zhaolei@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-07 10:05:57 +02:00
Frederic Weisbecker
e8808c1019 tracing/filters: support for filters of dynamic sized arrays
Currently the filtering infrastructure supports well the
numeric types and fixed sized array types.

But the recently added __string() field uses a specific
indirect offset mechanism which requires a specific
predicate. Until now it wasn't supported.

This patch adds this support and implies very few changes,
only a new predicate is needed, the management of this specific
field can be done through the usual string helpers in the
filtering infrastructure.

[ Impact: support all kinds of strings in the tracing filters ]

Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Zhaolei <zhaolei@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-05-07 10:05:57 +02:00
David Rientjes
aa47b7e0f8 sched: emit thread info flags with stack trace
When a thread is oom killed and fails to exit, it's helpful to know which
threads have access to memory reserves if the machine livelocks.  This is
done by testing for the TIF_MEMDIE thread info flag and should be
displayed alongside stack traces to identify tasks that have access to
such reserves but are still stuck allocating pages, for instance.

It would probably be helpful in other cases as well, so all thread info
flags are emitted when showing a task.

( v2: fix warning reported by Stephen Rothwell )

[ Impact: extend debug printout info ]

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
LKML-Reference: <alpine.DEB.2.00.0905040136390.15831@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-07 09:36:28 +02:00
Steven Rostedt
8ae79a138e tracing: add hierarchical enabling of events
With the current event directory, you can only enable individual events.
The file debugfs/tracing/set_event is used to be able to enable or
disable several events at once. But that can still be awkward.

This patch adds hierarchical enabling of events. That is, each directory
in debugfs/tracing/events has an "enable" file. This file can enable
or disable all events within the directory and below.

 # echo 1 > /debugfs/tracing/events/enable

will enable all events.

 # echo 1 > /debugfs/tracing/events/sched/enable

will enable all events in the sched subsystem.

 # echo 1 > /debugfs/tracing/events/enable
 # echo 0 > /debugfs/tracing/events/irq/enable

will enable all events, but then disable just the irq subsystem events.

When reading one of these enable files, there are four results:

 0 - all events this file affects are disabled
 1 - all events this file affects are enabled
 X - there is a mixture of events enabled and disabled
 ? - this file does not affect any event

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-06 23:11:42 -04:00
Steven Rostedt
9456f0fa6d tracing: reset ring buffer when removing modules with events
Li Zefan found that there's a race using the event ids of events and
modules. When a module is loaded, an event id is incremented. We only
have 16 bits for event ids (65536) and there is a possible (but highly
unlikely) race that we could load and unload a module that registers
events so many times that the event id counter overflows.

When it overflows, it then restarts and goes looking for available
ids. An id is available if it was added by a module and released.

The race is if you have one module add an id, and then is removed.
Another module loaded can use that same event id. But if the old module
still had events in the ring buffer, the new module's call back would
get bogus data.  At best (and most likely) the output would just be
garbage. But if the module for some reason used pointers (not recommended)
then this could potentially crash.

The safest thing to do is just reset the ring buffer if a module that
registered events is removed.

[ Impact: prevent unpredictable results of event id overflows ]

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <49FEAFD0.30106@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-06 23:11:41 -04:00
Andi Kleen
57adc4d2db Eliminate thousands of warnings with gcc 3.2 build
When building with gcc 3.2 I get thousands of warnings such as

include/linux/gfp.h: In function `allocflags_to_migratetype':
include/linux/gfp.h:105: warning: null format string

due to passing a NULL format string to warn_slowpath() in

#define __WARN()		warn_slowpath(__FILE__, __LINE__, NULL)

Split this case out into a separate call.  This also shrinks the kernel
slightly:

          text    data     bss     dec     hex filename
       4802274  707668  712704 6222646  5ef336 vmlinux
          text    data     bss     dec     hex filename
       4799027  703572  712704 6215303  5ed687 vmlinux

due to removeing one argument from the commonly-called __WARN().

[akpm@linux-foundation.org: reduce scope of `empty']
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06 16:36:09 -07:00
Wu Fengguang
381a80e6df inotify: use GFP_NOFS in kernel_event() to work around a lockdep false-positive
There is what we believe to be a false positive reported by lockdep.

inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock

The plan is to fix this via lockdep annotation, but that is proving to be
quite involved.

The patch flips the allocation over to GFP_NFS to shut the warning up, for
the 2.6.30 release.

Hopefully we will fix this for real in 2.6.31.  I'll queue a patch in -mm
to switch it back to GFP_KERNEL so we don't forget.

  =================================
  [ INFO: inconsistent lock state ]
  2.6.30-rc2-next-20090417 #203
  ---------------------------------
  inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
  kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
   (&inode->inotify_mutex){+.+.?.}, at: [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
  {RECLAIM_FS-ON-W} state was registered at:
    [<ffffffff81079188>] mark_held_locks+0x68/0x90
    [<ffffffff810792a5>] lockdep_trace_alloc+0xf5/0x100
    [<ffffffff810f5261>] __kmalloc_node+0x31/0x1e0
    [<ffffffff81130652>] kernel_event+0xe2/0x190
    [<ffffffff81130826>] inotify_dev_queue_event+0x126/0x230
    [<ffffffff8112f096>] inotify_inode_queue_event+0xc6/0x110
    [<ffffffff8110444d>] vfs_create+0xcd/0x140
    [<ffffffff8110825d>] do_filp_open+0x88d/0xa20
    [<ffffffff810f6b68>] do_sys_open+0x98/0x140
    [<ffffffff810f6c50>] sys_open+0x20/0x30
    [<ffffffff8100c272>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff
  irq event stamp: 690455
  hardirqs last  enabled at (690455): [<ffffffff81564fe4>] _spin_unlock_irqrestore+0x44/0x80
  hardirqs last disabled at (690454): [<ffffffff81565372>] _spin_lock_irqsave+0x32/0xa0
  softirqs last  enabled at (690178): [<ffffffff81052282>] __do_softirq+0x202/0x220
  softirqs last disabled at (690157): [<ffffffff8100d50c>] call_softirq+0x1c/0x50

  other info that might help us debug this:
  2 locks held by kswapd0/380:
   #0:  (shrinker_rwsem){++++..}, at: [<ffffffff810d0bd7>] shrink_slab+0x37/0x180
   #1:  (&type->s_umount_key#17){++++..}, at: [<ffffffff8110cfbf>] shrink_dcache_memory+0x11f/0x1e0

  stack backtrace:
  Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
  Call Trace:
   [<ffffffff810789ef>] print_usage_bug+0x19f/0x200
   [<ffffffff81018bff>] ? save_stack_trace+0x2f/0x50
   [<ffffffff81078f0b>] mark_lock+0x4bb/0x6d0
   [<ffffffff810799e0>] ? check_usage_forwards+0x0/0xc0
   [<ffffffff8107b142>] __lock_acquire+0xc62/0x1ae0
   [<ffffffff810f478c>] ? slob_free+0x10c/0x370
   [<ffffffff8107c0a1>] lock_acquire+0xe1/0x120
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff81562d43>] mutex_lock_nested+0x63/0x420
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff81012fe9>] ? sched_clock+0x9/0x10
   [<ffffffff81077165>] ? lock_release_holdtime+0x35/0x1c0
   [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
   [<ffffffff8110c9dc>] dentry_iput+0xbc/0xe0
   [<ffffffff8110cb23>] d_kill+0x33/0x60
   [<ffffffff8110ce23>] __shrink_dcache_sb+0x2d3/0x350
   [<ffffffff8110cffa>] shrink_dcache_memory+0x15a/0x1e0
   [<ffffffff810d0cc5>] shrink_slab+0x125/0x180
   [<ffffffff810d1540>] kswapd+0x560/0x7a0
   [<ffffffff810ce160>] ? isolate_pages_global+0x0/0x2c0
   [<ffffffff81065a30>] ? autoremove_wake_function+0x0/0x40
   [<ffffffff8107953d>] ? trace_hardirqs_on+0xd/0x10
   [<ffffffff810d0fe0>] ? kswapd+0x0/0x7a0
   [<ffffffff8106555b>] kthread+0x5b/0xa0
   [<ffffffff8100d40a>] child_rip+0xa/0x20
   [<ffffffff8100cdd0>] ? restore_args+0x0/0x30
   [<ffffffff81065500>] ? kthread+0x0/0xa0
   [<ffffffff8100d400>] ? child_rip+0x0/0x20

[eparis@redhat.com: fix audit too]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06 16:36:09 -07:00
Steven Rostedt
3e07a4f680 ring-buffer: change test to be more latency friendly
The ring buffer benchmark/test runs a producer for 10 seconds.
This is done with preemption and interrupts enabled. But if the kernel
is not compiled with CONFIG_PREEMPT, it basically stops everything
but interrupts for 10 seconds.

Although this is just a test and is not for production, this attribute
can be quite annoying. It can also spawn badness elsewhere.

This patch solves the issues by calling "cond_resched" when the system
is not compiled with CONFIG_PREEMPT. It also keeps track of the time
spent to call cond_resched such that it does not go against the
time calculations. That is, if the task schedules away, the time scheduled
out is removed from the test data. Note, this only works for non PREEMPT
because we do not know when the task is scheduled out if we have PREEMPT
enabled.

[ Impact: prevent test from stopping the world for 10 seconds ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-06 18:36:59 -04:00
Steven Rostedt
6634ff26cc ring-buffer: make moving the tail page a separate function
Ingo Molnar thought the code would be cleaner if we used a function call
instead of a goto for moving the tail page. After implementing this,
it seems that gcc still inlines the result and the output is pretty much
the same. Since this is considered a cleaner approach, might as well
implement it.

[ Impact: code clean up ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-06 15:30:07 -04:00
Steven Rostedt
00c81a58c5 ring-buffer: check for failed allocation in ring buffer benchmark
The result of the allocation of the ring buffer read page in the
ring buffer bench mark does not check the return to see if a page
was actually allocated. This patch fixes that.

[ Impact: avoid NULL dereference ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-06 12:49:20 -04:00
Steven Rostedt
8e7abf1c62 ring-buffer: remove unneeded conditional in rb_reserve_next
The code in __rb_reserve_next checks on page overflow if it is the
original commiter and then resets the page back to the original
setting.  Although this is fine, and the code is correct, it is
a bit fragil. Some experimental work I did breaks it easily.

The better and more robust solution is to have all commiters that
overflow the page, simply subtract what they added.

[ Impact: more robust ring buffer account management ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-06 12:49:19 -04:00
Jaswinder Singh Rajput
48dd0fed90 tracing: trace_output.c, fix false positive compiler warning
This compiler warning:

  CC      kernel/trace/trace_output.o
 kernel/trace/trace_output.c: In function ‘register_ftrace_event’:
 kernel/trace/trace_output.c:544: warning: ‘list’ may be used uninitialized in this function

Is wrong as 'list' is always initialized - but GCC (4.3.2) does not
recognize this relationship properly.

Work around the warning by initializing the variable to NULL.

[ Impact: fix false positive compiler warning ]

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 14:19:16 +02:00
Alan D. Brunelle
22a7c31a96 blktrace: from-sector redundant in trace_block_remap
Remove redundant from-sector parameter: it's /always/ the bio's sector
passed in.

[ Impact: cleanup ]

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49FF517C.7000503@hp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 14:13:01 +02:00
Alan D. Brunelle
a42aaa3bbc blktrace: correct remap names
This attempts to clarify names utilized during block I/O remap
operations (partition, volume manager). It correctly matches up the
/from/ information for both device & sector. This takes in the concept
from Kosaki Motohiro and extends it to include better naming for the
"device_from" field.

[ Impact: cleanup ]

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49FF4FAE.3000301@hp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 14:13:00 +02:00
Mathieu Desnoyers
de1d728606 tracepoint: trace_sched_migrate_task(): remove parameter
The orig_cpu parameter in trace_sched_migrate_task() is not necessary,
it can be got by using task_cpu(p) in the probe.

[ Impact: micro-optimization ]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
[ modified from Mathieu's patch. The original patch is at:
  http://marc.info/?l=linux-kernel&m=123791201716239&w=2 ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: fweisbec@gmail.com
Cc: rostedt@goodmis.org
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: zhaolei@cn.fujitsu.com
Cc: laijs@cn.fujitsu.com
LKML-Reference: <49FFFDB7.1050402@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 12:15:51 +02:00
Li Zefan
20c8928abe tracing/events: fix concurrent access to ftrace_events list
A module will add/remove its trace events when it gets loaded/unloaded, so
the ftrace_events list is not "const", and concurrent access needs to be
protected.

This patch thus fixes races between loading/unloding modules and read
'available_events' or read/write 'set_event', etc.

Below shows how to reproduce the race:

 # for ((; ;)) { cat /mnt/tracing/available_events; } > /dev/null &
 # for ((; ;)) { insmod trace-events-sample.ko; rmmod sample; } &

After a while:

BUG: unable to handle kernel paging request at 0010011c
IP: [<c1080f27>] t_next+0x1b/0x2d
...
Call Trace:
 [<c10c90e6>] ? seq_read+0x217/0x30d
 [<c10c8ecf>] ? seq_read+0x0/0x30d
 [<c10b4c19>] ? vfs_read+0x8f/0x136
 [<c10b4fc3>] ? sys_read+0x40/0x65
 [<c1002a68>] ? sysenter_do_call+0x12/0x36

[ Impact: fix races when concurrent accessing ftrace_events list ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4A00F709.3080800@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 10:38:19 +02:00
Li Zefan
2df75e4157 tracing/events: fix memory leak when unloading module
When unloading a module, memory allocated by init_preds() and
trace_define_field() is not freed.

[ Impact: fix memory leak ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <4A00F6E0.3040503@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 10:38:19 +02:00
Ingo Molnar
3611dfb8ed Merge branch 'core/locking' into perfcounters/core
Merge reason: we moved a mutex.h commit that originated from the
              perfcounters tree into core/locking - but now merge
	      back that branch to solve a merge artifact and to
	      pick up cleanups of this commit that happened in
	      core/locking.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-06 08:47:26 +02:00
Steven Rostedt
5092dbc96f ring-buffer: add benchmark and tester
This patch adds code that can benchmark the ring buffer as well as
test it. This code can be compiled into the kernel (not recommended)
or as a module.

A separate ring buffer is used to not interfer with other users, like
ftrace. It creates a producer and a consumer (option to disable creation
of the consumer) and will run for 10 seconds, then sleep for 10 seconds
and then repeat.

While running, the producer will write 10 byte loads into the ring
buffer with just putting in the current CPU number. The reader will
continually try to read the buffer. The reader will alternate from reading
the buffer via event by event, or by full pages.

The output is a pr_info, thus it will fill up the syslogs.

  Starting ring buffer hammer
  End ring buffer hammer
  Time:     9000349 (usecs)
  Overruns: 12578640
  Read:     5358440  (by events)
  Entries:  0
  Total:    17937080
  Missed:   0
  Hit:      17937080
  Entries per millisec: 1993
  501 ns per entry
  Sleeping for 10 secs
  Starting ring buffer hammer
  End ring buffer hammer
  Time:     9936350 (usecs)
  Overruns: 0
  Read:     28146644  (by pages)
  Entries:  74
  Total:    28146718
  Missed:   0
  Hit:      28146718
  Entries per millisec: 2832
  353 ns per entry
  Sleeping for 10 secs

Time:      is the time the test ran
Overruns:  the number of events that were overwritten and not read
Read:      the number of events read (either by pages or events)
Entries:   the number of entries left in the buffer
                 (the by pages will only read full pages)
Total:     Entries + Read + Overruns
Missed:    the number of entries that failed to write
Hit:       the number of entries that were written

The above example shows that it takes ~353 nanosecs per entry when
there is a reader, reading by pages (and no overruns)

The event by event reader slowed the producer down to 501 nanosecs.

[ Impact: see how changes to the ring buffer affect stability and performance ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-06 00:08:50 -04:00
Steven Rostedt
aa20ae8444 ring-buffer: move big if statement down
In the hot path of the ring buffer "__rb_reserve_next" there's a big
if statement that does not even return back to the work flow.

	code;

	if (cross to next page) {

		[ lots of code ]

		return;
	}

	more code;

The condition is even the unlikely path, although we do not denote it
with an unlikely because gcc is fine with it. The condition is true when
the write crosses a page boundary, and we need to start at a new page.

Having this if statement makes it hard to read, but calling another
function to do the work is also not appropriate, because we are using a lot
of variables that were set before the if statement, and we do not want to
send them as parameters.

This patch changes it to a goto:

	code;

	if (cross to next page)
		goto next_page;

	more code;

	return;

next_page:

	[ lots of code]

This makes the code easier to understand, and a bit more obvious.

The output from gcc is practically identical. For some reason, gcc decided
to use different registers when I switched it to a goto. But other than that,
the logic is the same.

[ Impact: easier to read code ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 21:16:11 -04:00
Steven Rostedt
94487d6d53 tracing: use proper export symbol for tracing api
When adding the EXPORT_SYMBOL to some of the tracing API, I accidently
used EXPORT_SYMBOL instead of EXPORT_SYMBOL_GPL. This patch fixes
that mistake.

[ Impact: export the tracing code only for GPL modules ]

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 19:22:53 -04:00
Steven Rostedt
41ede23ede ring-buffer: disable writers when resetting buffers
As a precaution, it is best to disable writing to the ring buffers
when reseting them.

[ Impact: prevent weird things if write happens during reset ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 17:22:02 -04:00
Steven Rostedt
afbab76a62 ring-buffer: have read page swap increment counter with page entries
In the swap page ring buffer code that is used by the ftrace splice code,
we scan the page to increment the counter of entries read.

With the number of entries already in the page we simply need to add it.

[ Impact: speed up reading page from ring buffer ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 16:58:24 -04:00
Linus Torvalds
99ee12973e Merge branch 'timers/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  clockevents: prevent endless loop in tick_handle_periodic()
2009-05-05 12:09:38 -07:00
Linus Torvalds
bcb1656827 Merge branch 'irq/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  Revert "genirq: assert that irq handlers are indeed running in hardirq context"
2009-05-05 12:09:27 -07:00
Linus Torvalds
e858e8b076 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: account system time properly
2009-05-05 12:08:40 -07:00
Linus Torvalds
da87bbd142 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  kernel/posix-cpu-timers.c: fix sparse warning
  dma-debug: remove broken dma memory leak detection for 2.6.30
  locking: Documentation: lockdep-design.txt, fix note of state bits
2009-05-05 12:08:20 -07:00
Linus Torvalds
e91b3b2681 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: x86, mmiotrace: fix range test
  tracing: fix ref count in splice pages
2009-05-05 12:08:02 -07:00
Steven Rostedt
778c55d44e ring-buffer: record page entries in buffer page descriptor
Currently, when the ring buffer writer overflows the buffer and must
write over non consumed data, we increment the overrun counter by
reading the entries on the page we are about to overwrite. This reads
the entries one by one.

This is not very effecient. This patch adds another entry counter
into each buffer page descriptor that keeps track of the number of
entries on the page. Now on overwrite, the overrun counter simply
needs to add the number of entries that is on the page it is about
to overwrite.

[ Impact: speed up of ring buffer in overwrite mode ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 14:28:08 -04:00
Oleg Nesterov
41c51c98f5 rcu: rcu_sched_grace_period(): kill the bogus flush_signals()
As a kernel thread, rcu_sched_grace_period() runs with all signals ignored.
It can never receive a signal even if it sleeps in TASK_INTERRUPTIBLE, it
needs the explicit allow_signal() to be visible for signals.

[ Impact: reduce kernel size, remove dead code ]

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090503211118.GA22973@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-05 20:28:05 +02:00
Steven Rostedt
e4906eff9e ring-buffer: convert cpu buffer entries to local_t
The entries counter in cpu buffer is not atomic. It can be updated by
other interrupts or from another CPU (readers).

But making entries into "atomic_t" causes an atomic operation that can
hurt performance. Instead we convert it to a local_t that will increment
a counter with a local CPU atomic operation (if the arch supports it).

Instead of fighting with readers and overwrites that decrement the counter,
I added a "read" counter. Every time a reader reads an entry it is
incremented.

We already have a overrun counter and with that, the entries counter and
the read counter, we can calculate the total number of entries in the
buffer with:

  (entries - overrun) - read

As long as the total number of entries in the ring buffer is less than
the word size, this will work. But since the entries counter was previously
a long, this is no different than what we had before.

Thanks to Andrew Morton for pointing out in the first version that
atomic_t does not replace unsigned long. I switched to atomic_long_t
even though it is signed. A negative count is most likely a bug.

[ Impact: keep accurate count of cpu buffer entries ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 14:25:44 -04:00
Peter Zijlstra
2023b35921 perf_counter: inheritable sample counters
Redirect the output to the parent counter and put in some sanity checks.

[ Impact: new perfcounter feature - inherited sampling counters ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090505155437.331556171@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-05 20:18:33 +02:00
Peter Zijlstra
22c1558e51 perf_counter: fix the output lock
Use -1 instead of 0 as unlocked, since 0 is a valid cpu number.

( This is not an issue right now but will be once we allow multiple
  counters to output to the same mmap area. )

[ Impact: prepare code for multi-counter profile output ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090505155437.232686598@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-05 20:18:32 +02:00
Peter Zijlstra
c5078f78b4 perf_counter: provide an mlock threshold
Provide a threshold to relax the mlock accounting, increasing usability.

Each counter gets perf_counter_mlock_kb for free.

[ Impact: allow more mmap buffering ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090505155437.112113632@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-05 20:18:32 +02:00
Peter Zijlstra
6de6a7b957 perf_counter: add ioctl(PERF_COUNTER_IOC_RESET)
Provide a way to reset an existing counter - this eases PAPI
libraries around perfcounters.

Similar to read() it doesn't collapse pending child counters.

[ Impact: new perfcounter fd ioctl method to reset counters ]

Suggested-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090505155437.022272933@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-05 20:18:31 +02:00
Peter Zijlstra
c66de4a5be perf_counter: uncouple data_head updates from wakeups
Keep data_head up-to-date irrespective of notifications. This fixes
the case where you disable a counter and don't get a notification for
the last few pending events, and it also allows polling usage.

[ Impact: increase precision of perfcounter mmap-ed fields ]

Suggested-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090505155436.925084300@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-05 20:18:30 +02:00
Peter Zijlstra
60aa605dfc sched: rt: document the risk of small values in the bandwidth settings
Thomas noted that we should disallow sysctl_sched_rt_runtime == 0 for
(!RT_GROUP) since the root group always has some RT tasks in it.

Further, update the documentation to inspire clue.

[ Impact: exclude corner-case sysctl_sched_rt_runtime value ]

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090505155436.863098054@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-05 20:07:57 +02:00
Steven Rostedt
c8d771835e tracing: export stats of ring buffers to userspace
This patch adds stats to the ftrace ring buffers:

 # cat /debugfs/tracing/per_cpu/cpu0/stats
 entries: 42360
 overrun: 30509326
 commit overrun: 0
 nmi dropped: 0

Where entries are the total number of data entries in the buffer.

overrun is the number of entries not consumed and were overwritten by
the writer.

commit overrun is the number of entries dropped due to nested writers
wrapping the buffer before the initial writer finished the commit.

nmi dropped is the number of entries dropped due to the ring buffer
lock being held when an nmi was going to write to the ring buffer.
Note, this field will be meaningless and will go away when the ring
buffer becomes lockless.

[ Impact: let userspace know what is happening in the ring buffers ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 13:52:02 -04:00
Steven Rostedt
f0d2c681ac ring-buffer: add counters for commit overrun and nmi dropped entries
The WARN_ON in the ring buffer when a commit is preempted and the
buffer is filled by preceding writes can happen in normal operations.
The WARN_ON makes it look like a bug, not to mention, because
it does not stop tracing and calls printk which can also recurse, this
is prone to deadlock (the WARN_ON is not in a position to recurse).

This patch removes the WARN_ON and replaces it with a counter that
can be retrieved by a tracer. This counter is called commit_overrun.

While at it, I added a nmi_dropped counter to count any time an NMI entry
is dropped because the NMI could not take the spinlock.

[ Impact: prevent deadlock by printing normal case warning ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 13:51:02 -04:00
Steven Rostedt
d6ce96dabe ring-buffer: export symbols
I'm adding a module to do a series of tests on the ring buffer as well
as benchmarks. This module needs to have more of the ring buffer API
exported. There's nothing wrong with reading the ring buffer from a
module.

[ Impact: allow modules to read pages from the ring buffer ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-05-05 13:46:33 -04:00
Ingo Molnar
1dce8d99b8 perf_counter: convert perf_resource_mutex to a spinlock
Now percpu counters can be initialized very early. But the init
sequence uses mutex_lock(). Fortunately, perf_resource_mutex should
be a spinlock anyway, so convert it.

[ Impact: fix crash due to early init mutex use ]

LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-04 19:30:42 +02:00
Ingo Molnar
0d905bca23 perf_counter: initialize the per-cpu context earlier
percpu scheduling for perfcounters wants to take the context lock,
but that lock first needs to be initialized. Currently it is an
early_initcall() - but that is too late, the task tick runs much
sooner than that.

Call it explicitly from the scheduler init sequence instead.

[ Impact: fix access-before-init crash ]

LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-04 19:30:32 +02:00
Ingo Molnar
b82914ce33 perf_counter: round-robin per-CPU counters too
This used to be unstable when we had the rq->lock dependencies,
but now that they are that of the past we can turn on percpu
counter RR too.

[ Impact: handle counter over-commit for per-CPU counters too ]

LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-04 19:29:57 +02:00
Andrea Righi
9e4a5bda89 mm: prevent divide error for small values of vm_dirty_bytes
Avoid setting less than two pages for vm_dirty_bytes: this is necessary to
avoid potential division by 0 (like the following) in get_dirty_limits().

[   49.951610] divide error: 0000 [#1] PREEMPT SMP
[   49.952195] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
[   49.952195] CPU 1
[   49.952195] Modules linked in: pcspkr
[   49.952195] Pid: 3064, comm: dd Not tainted 2.6.30-rc3 #1
[   49.952195] RIP: 0010:[<ffffffff802d39a9>]  [<ffffffff802d39a9>] get_dirty_limits+0xe9/0x2c0
[   49.952195] RSP: 0018:ffff88001de03a98  EFLAGS: 00010202
[   49.952195] RAX: 00000000000000c0 RBX: ffff88001de03b80 RCX: 28f5c28f5c28f5c3
[   49.952195] RDX: 0000000000000000 RSI: 00000000000000c0 RDI: 0000000000000000
[   49.952195] RBP: ffff88001de03ae8 R08: 0000000000000000 R09: 0000000000000000
[   49.952195] R10: ffff88001ddda9a0 R11: 0000000000000001 R12: 0000000000000001
[   49.952195] R13: ffff88001fbc8218 R14: ffff88001de03b70 R15: ffff88001de03b78
[   49.952195] FS:  00007fe9a435b6f0(0000) GS:ffff8800025d9000(0000) knlGS:0000000000000000
[   49.952195] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   49.952195] CR2: 00007fe9a39ab000 CR3: 000000001de38000 CR4: 00000000000006e0
[   49.952195] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   49.952195] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   49.952195] Process dd (pid: 3064, threadinfo ffff88001de02000, task ffff88001ddda250)
[   49.952195] Stack:
[   49.952195]  ffff88001fa0de00 ffff88001f2dbd70 ffff88001f9fe800 000080b900000000
[   49.952195]  00000000000000c0 ffff8800027a6100 0000000000000400 ffff88001fbc8218
[   49.952195]  0000000000000000 0000000000000600 ffff88001de03bb8 ffffffff802d3ed7
[   49.952195] Call Trace:
[   49.952195]  [<ffffffff802d3ed7>] balance_dirty_pages_ratelimited_nr+0x1d7/0x3f0
[   49.952195]  [<ffffffff80368f8e>] ? ext3_writeback_write_end+0x9e/0x120
[   49.952195]  [<ffffffff802cc7df>] generic_file_buffered_write+0x12f/0x330
[   49.952195]  [<ffffffff802cce8d>] __generic_file_aio_write_nolock+0x26d/0x460
[   49.952195]  [<ffffffff802cda32>] ? generic_file_aio_write+0x52/0xd0
[   49.952195]  [<ffffffff802cda49>] generic_file_aio_write+0x69/0xd0
[   49.952195]  [<ffffffff80365fa6>] ext3_file_write+0x26/0xc0
[   49.952195]  [<ffffffff803034d1>] do_sync_write+0xf1/0x140
[   49.952195]  [<ffffffff80290d1a>] ? get_lock_stats+0x2a/0x60
[   49.952195]  [<ffffffff80280730>] ? autoremove_wake_function+0x0/0x40
[   49.952195]  [<ffffffff8030411b>] vfs_write+0xcb/0x190
[   49.952195]  [<ffffffff803042d0>] sys_write+0x50/0x90
[   49.952195]  [<ffffffff8022ff6b>] system_call_fastpath+0x16/0x1b
[   49.952195] Code: 00 00 00 2b 05 09 1c 17 01 48 89 c6 49 0f af f4 48 c1 ee 02 48 89 f0 48 f7 e1 48 89 d6 31 d2 48 c1 ee 02 48 0f af 75 d0 48 89 f0 <48> f7 f7 41 8b 95 ac 01 00 00 48 89 c7 49 0f af d4 48 c1 ea 02
[   49.952195] RIP  [<ffffffff802d39a9>] get_dirty_limits+0xe9/0x2c0
[   49.952195]  RSP <ffff88001de03a98>
[   50.096523] ---[ end trace 008d7aa02f244d7b ]---

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
Magnus Damm
c81fc2c331 clockevent: export register_device and delta2ns
Export the following symbols using EXPORT_SYMBOL_GPL:
 - clockevent_delta2ns
 - clockevents_register_device

This allows us to build SuperH clockevent and clocksource
drivers as modules, see drivers/clocksource/sh_*.c

[ Impact: allow modular build of clockevent drivers ]

Signed-off-by: Magnus Damm <damm@igel.co.jp>
LKML-Reference: <20090501055247.8286.64067.sendpatchset@rx1.opensource.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 11:51:07 +02:00
john stultz
7d27558c41 timekeeping: create arch_gettimeoffset infrastructure
Some arches don't supply their own clocksource. This is mainly the
case in architectures that get their inter-tick times by reading the
counter on their interval timer.  Since these timers wrap every tick,
they're not really useful as clocksources.  Wrapping them to act like
one is possible but not very efficient. So we provide a callout these
arches can implement for use with the jiffies clocksource to provide
finer then tick granular time.

[ Impact: ease the migration to generic time keeping ]

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 11:45:15 +02:00
Magnus Damm
a25cbd045a clocksource: setup mult_orig in clocksource_enable()
Setup clocksource mult_orig in clocksource_enable().

Clocksource drivers can save power by using keeping the
device clock disabled while the clocksource is unused.

In practice this means that the enable() and disable()
callbacks perform clk_enable() and clk_disable().

The enable() callback may also use clk_get_rate() to get
the clock rate from the clock framework. This information
can then be used to calculate the shift and mult variables.

Currently the mult_orig variable is setup from mult at
registration time only. This is conflicting with the above
case since the clock is disabled and the mult variable is
not yet calculated at the time of registration.

Moving the mult_orig setup code to clocksource_enable()
allows us to both handle the common case with no enable()
callback and the mult-changed-after-enable() case.

[ Impact: allow dynamic clock source usage ]

Signed-off-by: Magnus Damm <damm@igel.co.jp>
LKML-Reference: <20090501054546.8193.10688.sendpatchset@rx1.opensource.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 11:45:15 +02:00
Jon Hunter
a041988876 timers: allow deferrable timers for intervals tv2-tv5 to be deferred
In the current kernel implementation only kernel timers for time interval
tv1 are being deferred. This patch allows any timer that is configured as
deferrable to be defer regardless of time interval.

This patch was previously discussed in
http://marc.info/?l=linux-kernel&m=123196343531966&w=2 and was acked by
Venki Pallipadi, the author of the original deferrable timer patch.

Signed-off-by: Jon Hunter <jon-hunter@ti.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 10:36:06 +02:00
Dmitri Vorobiev
a52f5c5620 clockevents: tick_broadcast_device can become static
The variable tick_broadcast_device is not used outside of the
file where it is defined, so let's make it static.

Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-02 10:31:14 +02:00
john stultz
74a03b69d1 clockevents: prevent endless loop in tick_handle_periodic()
tick_handle_periodic() can lock up hard when a one shot clock event
device is used in combination with jiffies clocksource.

Avoid an endless loop issue by requiring that a highres valid
clocksource be installed before we call tick_periodic() in a loop when
using ONESHOT mode. The result is we will only increment jiffies once
per interrupt until a continuous hardware clocksource is available.

Without this, we can run into a endless loop, where each cycle through
the loop, jiffies is updated which increments time by tick_period or
more (due to clock steering), which can cause the event programming to
think the next event was before the newly incremented time and fail
causing tick_periodic() to be called again and the whole process loops
forever.

[ Impact: prevent hard lock up ]

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
2009-05-02 10:22:27 +02:00
Yinghai Lu
15e957d08d x86/irq: use move_irq_desc() in create_irq_nr()
move_irq_desc() will try to move irq_desc to the home node if
the allocated one is not correct, in create_irq_nr().

( This can happen on devices that are on different nodes that
  are using MSI, when drivers are loaded and unloaded randomly. )

v2: fix non-smp build
v3: add NUMA_IRQ_DESC to eliminate #ifdefs

[ Impact: improve irq descriptor locality on NUMA systems ]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <49F95EAE.2050903@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-01 19:01:12 +02:00
Thomas Gleixner
d7226fb6ec Revert "genirq: assert that irq handlers are indeed running in hardirq context"
This reverts commit 044d408409.

The commit added a warning when handle_IRQ_event() is called outside
of hard interrupt context. This breaks the generic tasklet based
interrupt resend mechanism which is used when the hardware has no way
to retrigger the interrupt. So we get a warning for a use case which
is correct and worked for years. Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-05-01 15:16:04 +02:00
Peter Zijlstra
c33a0bc4e4 perf_counter: fix race in perf_output_*
When two (or more) contexts output to the same buffer, it is possible
to observe half written output.

Suppose we have CPU0 doing perf_counter_mmap(), CPU1 doing
perf_counter_overflow(). If CPU1 does a wakeup and exposes head to
user-space, then CPU2 can observe the data CPU0 is still writing.

[ Impact: fix occasionally corrupted profiling records ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090501102533.007821627@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-01 13:23:43 +02:00
Oleg Nesterov
78a3d9d565 do_wait: do take security_task_wait() into account
I was never able to understand what should we actually do when
security_task_wait() fails, but the current code doesn't look right.

If ->task_wait() returns the error, we update *notask_error correctly.
But then we either reap the child (despite the fact this was forbidden)
or clear *notask_error (and hide the securiy policy problems).

This patch assumes that "stolen by ptrace" doesn't matter. If selinux
denies the child we should ignore it but make sure we report -EACCESS
instead of -ECHLD if there are no other eligible children.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-01 08:49:29 +10:00
Thomas Gleixner
3c56999eec Merge branch 'core/signal' into perfcounters/core
This is necessary to avoid the conflict of syscall numbers.

Conflicts:
	arch/x86/ia32/ia32entry.S
	arch/x86/include/asm/unistd_32.h
	arch/x86/include/asm/unistd_64.h

Fixes up the borked syscall numbers of perfcounters versus
preadv/pwritev as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-30 21:16:49 +02:00
Thomas Gleixner
62ab4505e3 signals: implement sys_rt_tgsigqueueinfo
sys_kill has the per thread counterpart sys_tgkill. sigqueueinfo is
missing a thread directed counterpart. Such an interface is important
for migrating applications from other OSes which have the per thread
delivery implemented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Ulrich Drepper <drepper@redhat.com>
2009-04-30 19:24:24 +02:00
Thomas Gleixner
30b4ae8a44 signals: split do_tkill
Split out the code from do_tkill to make it reusable by the follow up
patch which implements sys_rt_tgsigqueueinfo

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
2009-04-30 19:24:23 +02:00
Darren Hart
ba9c22f2c0 futex: remove FUTEX_REQUEUE_PI (non CMP)
The new requeue PI futex op codes were modeled after the existing
FUTEX_REQUEUE and FUTEX_CMP_REQUEUE calls.  I was unaware at the time
that FUTEX_REQUEUE was only around for compatibility reasons and
shouldn't be used in new code.  Ulrich Drepper elaborates on this in his
Futexes are Tricky paper: http://people.redhat.com/drepper/futex.pdf.
The deprecated call doesn't catch changes to the futex corresponding to
the destination futex which can lead to deadlock.

Therefor, I feel it best to remove FUTEX_REQUEUE_PI and leave only
FUTEX_CMP_REQUEUE_PI as there are not yet any existing users of the API.
This patch does change the OP code value of FUTEX_CMP_REQUEUE_PI to 12
from 13.  Since my test case is the only known user of this API, I felt
this was the right thing to do, rather than leave a hole in the
enumeration.

I chose to continue using the _CMP_ modifier in the OP code to make it
explicit to the user that the test is being done.

Builds, boots, and ran several hundred iterations requeue_pi.c.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <49ED580E.1050502@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-30 11:41:35 +02:00
Andrew Morton
a511e3f968 mutex: add atomic_dec_and_mutex_lock(), fix
include/linux/mutex.h:136: warning: 'mutex_lock' declared inline after being called
 include/linux/mutex.h:136: warning: previous declaration of 'mutex_lock' was here

uninline it.

[ Impact: clean up and uninline, address compiler warning ]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Paris <eparis@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <200904292318.n3TNIsi6028340@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-30 09:01:34 +02:00
Paul Mackerras
c5dd016cdf perf_counter: update copyright notice
This adds my name to the list of copyright holders on the core
perf_counter.c, since I have contributed a significant amount of the
code in there.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
LKML-Reference: <18936.59200.888049.746658@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-30 08:23:11 +02:00
H Hartley Sweeten
6e85c5ba73 kernel/posix-cpu-timers.c: fix sparse warning
Sparse reports the following in kernel/posix-cpu-timers.c:

  warning: symbol 'firing' shadows an earlier one

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Subrata Modak <subrata@linux.vnet.ibm.com>
LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE1909016C1AFE@mi8nycmail19.Mi8.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-30 08:08:31 +02:00
David Howells
3bcac0263f SELinux: Don't flush inherited SIGKILL during execve()
Don't flush inherited SIGKILL during execve() in SELinux's post cred commit
hook.  This isn't really a security problem: if the SIGKILL came before the
credentials were changed, then we were right to receive it at the time, and
should honour it; if it came after the creds were changed, then we definitely
should honour it; and in any case, all that will happen is that the process
will be scrapped before it ever returns to userspace.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-04-30 09:07:13 +10:00
Luis Henriques
23b94b967f locking, rtmutex.c: Documentation cleanup
Two minor updates on functions documentation:
 - Updated documentation for function rt_mutex_unlock(), which contained an
   incorrect name
 - Removed extra '*' from comment in function rt_mutex_destroy()

[ Impact: cleanup ]

Signed-off-by: Luis Henriques <henrix@sapo.pt>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20090429205451.GA23154@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 23:20:17 +02:00
Eric Dumazet
f5f293a4e3 sched: account system time properly
Andrew Gallatin reported that IRQ and SOFTIRQ times were
sometime not reported correctly on recent kernels, and even
bisected to commit 457533a7d3
([PATCH] fix scaled & unscaled cputime accounting) as the first
bad commit.

Further analysis pointed that commit
79741dd357 ([PATCH] idle cputime
accounting) was the real cause of the problem.

account_process_tick() was not taking into account timer IRQ
interrupting the idle task servicing a hard or soft irq.

On mostly idle cpu, irqs were thus not accounted and top or
mpstat could tell user/admin that cpu was 100 % idle, 0.00 %
irq, 0.00 % softirq, while it was not.

[ Impact: fix occasionally incorrect CPU statistics in top/mpstat ]

Reported-by: Andrew Gallatin <gallatin@myri.com>
Re-reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: rick.jones2@hp.com
Cc: brice@myri.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <49F84BC1.7080602@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 15:02:28 +02:00
Ingo Molnar
9814451142 perf_counter: add/update copyrights
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:52:50 +02:00
Robert Richter
4aeb0b4239 perfcounters: rename struct hw_perf_counter_ops into struct pmu
This patch renames struct hw_perf_counter_ops into struct pmu. It
introduces a structure to describe a cpu specific pmu (performance
monitoring unit). It may contain ops and data. The new name of the
structure fits better, is shorter, and thus better to handle. Where it
was appropriate, names of function and variable have been changed too.

[ Impact: cleanup ]

Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1241002046-8832-7-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:51:03 +02:00
Ingo Molnar
e7fd5d4b3d Merge branch 'linus' into perfcounters/core
Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up
              the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:47:05 +02:00
David Howells
50fa610a3b sched: Document memory barriers implied by sleep/wake-up primitives
Add a section to the memory barriers document to note the implied
memory barriers of sleep primitives (set_current_state() and wrappers)
and wake-up primitives (wake_up() and co.).

Also extend the in-code comments on the wake_up() functions to note
these implied barriers.

[ Impact: add documentation ]

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090428140138.1192.94723.stgit@warthog.procyon.org.uk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:15:55 +02:00
Heiko Carstens
a0e39ed378 tracing: fix build failure on s390
"tracing: create automated trace defines" causes this compile error on s390,
as reported by Sachin Sant against linux-next:

 kernel/built-in.o: In function `__do_softirq':
 (.text+0x1c680): undefined reference to `__tracepoint_softirq_entry'

This happens because the definitions of the softirq tracepoints were moved
from kernel/softirq.c to kernel/irq/handle.c. Since s390 doesn't support
generic hardirqs handle.c doesn't get compiled and the definitions are
missing.

So move the tracepoints to softirq.c again.

[ Impact: fix build failure on s390 ]

Reported-by: Sachin Sant <sachinp@in.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
LKML-Reference: <20090429135139.5fac79b8@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:06:21 +02:00
Tom Zanussi
8b37256210 tracing/filters: a better event parser
Replace the current event parser hack with a better one.  Filters are
no longer specified predicate by predicate, but all at once and can
use parens and any of the following operators:

numeric fields:

==, !=, <, <=, >, >=

string fields:

==, !=

predicates can be combined with the logical operators:

&&, ||

examples:

"common_preempt_count > 4" > filter

"((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter

If there was an error, the erroneous string along with an error
message can be seen by looking at the filter e.g.:

((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
^
parse_error: Field not found

Currently the caret for an error always appears at the beginning of
the filter; a real position should be used, but the error message
should be useful even without it.

To clear a filter, '0' can be written to the filter file.

Filters can also be set or cleared for a complete subsystem by writing
the same filter as would be written to an individual event to the
filter file at the root of the subsytem.  Note however, that if any
event in the subsystem lacks a field specified in the filter being
set, the set will fail and all filters in the subsytem are
automatically cleared.  This change from the previous version was made
because using only the fields that happen to exist for a given event
would most likely result in a meaningless filter.

Because the logical operators are now implemented as predicates, the
maximum number of predicates in a filter was increased from 8 to 16.

[ Impact: add new, extended trace-filter implementation ]

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1240905899.6416.121.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:06:11 +02:00
Tom Zanussi
a118e4d140 tracing/filters: distinguish between signed and unsigned fields
The new filter comparison ops need to be able to distinguish between
signed and unsigned field types, so add an is_signed flag/param to the
event field struct/trace_define_fields().  Also define a simple macro,
is_signed_type() to determine the signedness at compile time, used in the
trace macros.  If the is_signed_type() macro won't work with a specific
type, a new slightly modified version of TRACE_FIELD() called
TRACE_FIELD_SIGN(), allows the signedness to be set explicitly.

[ Impact: extend trace-filter code for new feature ]

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1240905893.6416.120.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:06:03 +02:00
Tom Zanussi
30e673b230 tracing/filters: move preds into event_filter object
Create a new event_filter object, and move the pred-related members
out of the call and subsystem objects and into the filter object - the
details of the filter implementation don't need to be exposed in the
call and subsystem in any case, and it will also help make the new
parser implementation a little cleaner.

[ Impact: refactor trace-filter code to prepare for new features ]

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1240905887.6416.119.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:05:54 +02:00
Steven Rostedt
7267fa6819 tracing: fix ref count in splice pages
The pages allocated for the splice binary buffer did not initialize
the ref count correctly. This caused pages not to be freed and causes
a drastic memory leak.

Thanks to logdev I was able to trace the tracer to find where the leak
was.

[ Impact: stop memory leak when using splice ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 08:02:44 +02:00
Steven Rostedt
7d7d2b8031 ring-buffer: fix printk output
The warning output in trace_recursive_lock uses %d for a long when
it should be %ld.

[ Impact: fix compile warning ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-29 00:42:01 -04:00
Steven Rostedt
f2957f1f19 tracing: have splice only copy full pages
Splice works with pages, it is much more effecient to use an entire
page than to copy bits over several pages.

Using logdev to trace the internals of the splice mechanism, I was
able to see that splice can be very aggressive. When tracing is
occurring, and the reader caught up to the writer, and the writer
is on the reader page, the reader will copy what is there into the
splice page. Splice may iterate over several pages and if the
writer is still writing to the page, the reader will keep copying
bits to new pages to pass to userspace.

This patch changes it to only pass data to userspace if the page
is full (the writer has left the page). This has a small side effect
that splice can not read a partial page, and must wait for the
page to fill. This should not be an issue. If tracing has stopped,
then a use of "read" will still read all of the page.

[ Impact: better performance for ring buffer splice code ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-29 00:26:30 -04:00
Steven Rostedt
93459c6cb9 tracing: only add splice page if entries exist
The splice code allocates a page even when the ring buffer is empty.
It detects the ring buffer being empty when it it fails to copy
anything from the ring buffer into the page.

This patch adds a check to see if there is anything in the ring buffer
before allocating a page.

Thanks to logdev for letting me trace the tracer to find this.

[ Impact: speed up due to removing unnecessary allocation ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-29 00:23:13 -04:00
Steven Rostedt
5beae6efd1 tracing: fix ref count in splice pages
The pages allocated for the splice binary buffer did not initialize
the ref count correctly. This caused pages not to be freed and causes
a drastic memory leak.

Thanks to logdev I was able to trace the tracer to find where the leak
was.

[ Impact: stop memory leak when using splice ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-29 00:16:21 -04:00
Steven Rostedt
cd891ae030 tracing: convert ftrace_dump spinlocks to raw
ftrace_dump is used for printing out the contents of the ftrace ring buffer
to the console on failure. Currently it uses a spinlock to synchronize
the output from multiple failures on different CPUs. This spin lock
currently is a normal spinlock and can cause issues with lockdep and
lock tracing.

This patch converts it to raw since it is for error handling only.
The lock is local to the ftrace_dump and is not used by any other
infrastructure.

[ Impact: prevent ftrace_dump from locking up by internal tracing ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-28 11:39:34 -04:00
Yinghai Lu
85ac16d033 x86/irq: change irq_desc_alloc() to take node instead of cpu
This simplifies the node awareness of the code. All our allocators
only deal with a NUMA node ID locality not with CPU ids anyway - so
there's no need to maintain (and transform) a CPU id all across the
IRq layer.

v2: keep move_irq_desc related

[ Impact: cleanup, prepare IRQ code to be NUMA-aware ]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
LKML-Reference: <49F65536.2020300@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-28 12:21:17 +02:00
Yinghai Lu
57b150cce8 irq: only update affinity if ->set_affinity() is sucessfull
irq_set_affinity() and move_masked_irq() try to assign affinity
before calling chip set_affinity(). Some archs are assigning it
in ->set_affinity() again.

We do something like:

 cpumask_cpy(desc->affinity, mask);
 desc->chip->set_affinity(mask);

But in the failure path, affinity should not be touched - otherwise
we'll end up with a different affinity mask despite the failure to
migrate the IRQ.

So try to update the afffinity only if set_affinity returns with 0.
Also call irq_set_thread_affinity accordingly.

v2: update after "irq, x86: Remove IRQ_DISABLED check in process context IRQ move"
v3: according to Ingo, change set_affinity() in irq_chip should return int.
v4: update comments by removing moving irq_desc code.

[ Impact: fix /proc/irq/*/smp_affinity setting corner case bug ]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <49F65509.60307@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-28 12:21:16 +02:00
Yinghai Lu
fcef5911c7 x86/irq: remove leftover code from NUMA_MIGRATE_IRQ_DESC
The original feature of migrating irq_desc dynamic was too fragile
and was causing problems: it caused crashes on systems with lots of
cards with MSI-X when user-space irq-balancer was enabled.

We now have new patches that create irq_desc according to device
numa node. This patch removes the leftover bits of the dynamic balancer.

[ Impact: remove dead code ]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <49F654AF.8000808@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-28 12:21:15 +02:00
Yinghai Lu
9ec4fa271f irq, cpumask: correct CPUMASKS_OFFSTACK typo and fix fallout
CPUMASKS_OFFSTACK is not defined anywhere (it is CPUMASK_OFFSTACK).
It is a typo and init_allocate_desc_masks() is called before it set
affinity to all cpus...

Split init_alloc_desc_masks() into all_desc_masks() and init_desc_masks().

Also use CPUMASK_OFFSTACK in alloc_desc_masks().

[ Impact: fix smp_affinity copying/setup when moving irq_desc between CPUs ]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
LKML-Reference: <49F6546E.3040406@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-28 12:21:14 +02:00
Linus Torvalds
51b3960e78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  ptrace: ptrace_attach: fix the usage of ->cred_exec_mutex
2009-04-27 08:38:51 -07:00
Oleg Nesterov
cad81bc252 ptrace: ptrace_attach: fix the usage of ->cred_exec_mutex
ptrace_attach() needs task->cred_exec_mutex, not current->cred_exec_mutex.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-04-27 20:30:51 +10:00
Linus Torvalds
fc2e3180a7 Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86/irq: mark NUMA_MIGRATE_IRQ_DESC broken
  x86, irq: Remove IRQ_DISABLED check in process context IRQ move
2009-04-26 10:29:18 -07:00
Linus Torvalds
1e4b978154 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  locking: clarify kernel-taint warning message
  lockdep, x86: account for irqs enabled in paranoid_exit
  lockdep: more robust lockdep_map init sequence
2009-04-26 10:29:01 -07:00
Steven Rostedt
701970b3a8 tracing/events: make modules have their own file_operations structure
For proper module reference counting, the file_operations that modules use
must have the "owner" field set to the module. Unfortunately, the trace events
use share file_operations. The same file_operations are used by all both
kernel core and all modules.

This patch makes the modules allocate their own file_operations and
copies the functions from the core kernel. This allows those file
operations to be owned by the module.

Care is taken to free this code on module unload.

Thanks to Greg KH for reminding me that file_operations must be owned
by the module to have reference counting take place.

[ Impact: fix modular tracepoints / potential crash ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-26 13:07:00 +02:00
Steven Rostedt
060fa5c83e tracing/events: reuse trace event ids after overflow
With modules being able to add trace events, and the max trace event
counter is 16 bits (65536) we can overflow the counter easily
with a simple while loop adding and removing modules that contain
trace events.

This patch links together the registered trace events and on overflow
searches for available trace event ids. It will still fail if
over 65536 events are registered, but considering that a typical
kernel only has 22000 functions, 65000 events should be sufficient.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-24 23:06:00 -04:00
Rafael J. Wysocki
0c8454f566 PM/Hibernate: Fix waiting for image device to appear on resume
Commit c751085943 ("PM/Hibernate: Wait for
SCSI devices scan to complete during resume") added a call to
scsi_complete_async_scans() to software_resume(), so that it waited for
the SCSI scanning to complete, but the call was added at a wrong place.

Namely, it should have been added after wait_for_device_probe(), which
is called only if the image partition hasn't been specified yet.  Also,
it's reasonable to check if the image partition is present and only wait
for the device probing and SCSI scanning to complete if it is not the
case.

Additionally, since noresume is checked right at the beginning of
software_resume() and the function returns immediately if it's set, it
doesn't make sense to check it once again later.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24 15:31:30 -07:00
Jonathan Corbet
418df63c2d Delete slow-work timers properly
Slow-work appears to delete its timer as soon as the first user
unregisters, even though other users could be active.  At the same time, it
never seems to delete slow_work_oom_timer.  Arrange for both to happen in
the shutdown path.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24 07:47:59 -07:00
Ingo Molnar
416dfdcdb8 Merge commit 'v2.6.30-rc3' into tracing/hw-branch-tracing
Conflicts:
	arch/x86/kernel/ptrace.c

Merge reason: fix the conflict above, and also pick up the CONFIG_BROKEN
              dependency change from upstream so that we can remove it
	      here.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-24 10:11:23 +02:00
Lai Jiangshan
334d4169a6 ring_buffer: compressed event header
RB_MAX_SMALL_DATA = 28bytes is too small for most tracers, it wastes
an 'u32' to save the actually length for events which data size > 28.

This fix uses compressed event header and enlarges RB_MAX_SMALL_DATA.

[ Impact: saves about 0%-12.5%(depends on tracer) memory in ring_buffer ]

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <49F13189.3090000@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-24 00:08:38 -04:00
Steven Rostedt
75db37d2f4 tracing: add size checks for exported ftrace internal structures
The events exported by TRACE_EVENT are automated and are guaranteed
to be correct when used.

The internal ftrace structures on the other hand are more manually
exported. These require the ftrace maintainer to make sure they
are up to date.

This patch adds a size check to help flag when a type changes in
an internal ftrace data structure, and the update needs to be reflected
in the export.

If a export is incorrect, then the only harm is that the user space
tools will not know how to correctly read the internal structures of
ftrace.

[ Impact: help prevent inconsistent ftrace format print outs ]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-04-23 23:03:56 -04:00
Steven Rostedt
89ec0dee9e tracing: increase size of number of possible events
With the new event tracing registration, we must increase the number
of events that can be registered. Currently the type field is only
one byte, which leaves us only 256 possible events.

Since we do not save the CPU number in the tracer anymore (it is determined
by the per cpu ring buffer that is used) we have an extra byte to use.

This patch increases the size of type from 1 byte (256 events) to
2 bytes (65,536 events).

It also adds a WARN_ON_ONCE if we exceed that limit.

[ Impact: allow more than 255 events ]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-04-23 23:03:19 -04:00
Steven Rostedt
9be24414aa tracing/wakeup: move access to wakeup_cpu into spinlock
The code had the following outside the lock:

        if (next != wakeup_task)
                return;

        pc = preempt_count();

        /* The task we are waiting for is waking up */
        data = wakeup_trace->data[wakeup_cpu];

On initialization, wakeup_task is NULL and wakeup_cpu -1. This code
is not under a lock. If wakeup_task is set on another CPU as that
task is waking up, we can see the wakeup_task before wakeup_cpu is
set. If we read wakeup_cpu while it is still -1 then we will have
a bad data pointer.

This patch moves the reading of wakeup_cpu within the protection of
the spinlock used to protect the writing of wakeup_cpu and wakeup_task.

[ Impact: remove possible race causing invalid pointer dereference ]

Reported-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-04-23 23:01:36 -04:00
Ingo Molnar
b48ccb095a locking: clarify kernel-taint warning message
Andi Kleen reported this message triggering on non-lockdep kernels:

   Disabling lockdep due to kernel taint

Clarify the message to say 'lock debugging' - debug_locks_off()
turns off all things lock debugging, not just lockdep.

[ Impact: change kernel warning message text ]

Reported-by: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-23 09:36:52 +02:00
Li Zefan
7a4f453b6d tracing/events: make struct trace_entry->type to be int type
struct trace_entry->type is unsigned char, while trace event's id is
int type, thus for a event with id >= 256, it's entry->type is cast
to (id % 256), and then we can't see the trace output of this event.

 # insmod trace-events-sample.ko
 # echo foo_bar > /mnt/tracing/set_event
 # cat /debug/tracing/events/trace-events-sample/foo_bar/id
 256
 # cat /mnt/tracing/trace_pipe
           <...>-3548  [001]   215.091142: Unknown type 0
           <...>-3548  [001]   216.089207: Unknown type 0
           <...>-3548  [001]   217.087271: Unknown type 0
           <...>-3548  [001]   218.085332: Unknown type 0

[ Impact: fix output for trace events with id >= 256 ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <49EEDB0E.5070207@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-22 11:36:38 +02:00
Magnus Damm
4614e6adaf clocksource: add enable() and disable() callbacks
Add enable() and disable() callbacks for clocksources.

This allows us to put unused clocksources in power save mode.  The
functions clocksource_enable() and clocksource_disable() wrap the
callbacks and are inserted in the timekeeping code to enable before use
and disable after switching to a new clocksource.

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-21 13:41:47 -07:00
Magnus Damm
8e19608e8b clocksource: pass clocksource to read() callback
Pass clocksource pointer to the read() callback for clocksources.  This
allows us to share the callback between multiple instances.

[hugh@veritas.com: fix powerpc build of clocksource pass clocksource mods]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-21 13:41:47 -07:00
Steven Rostedt
3554228d42 ring-buffer: only warn on wrap if buffer is bigger than two pages
On boot up, to save memory, ftrace allocates the minimum buffer
which is two pages. Ftrace also goes through a series of tests
(when configured) on boot up. These tests can fill up a page within
a single interrupt.

The ring buffer also has a WARN_ON when it detects that the buffer was
completely filled within a single commit (other commits are allowed to
be nested).

Combine the small buffer on start up, with the tests that can fill more
than a single page within an interrupt, this can trigger the WARN_ON.

This patch makes the WARN_ON only happen when the ring buffer consists
of more than two pages.

[ Impact: prevent false WARN_ON in ftrace startup tests ]

Reported-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20090421094616.GA14561@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-21 16:00:45 +02:00
Li Zefan
f66578a763 tracing/filters: allow user-input to be integer-like string
Suppose we would like to trace all tasks named '123', but this
will fail:

 # echo 'parent_comm == 123' > events/sched/sched_process_fork/filter
 bash: echo: write error: Invalid argument

Don't guess the type of the filter pred in filter_parse(), but instead
we check it in __filter_add_pred().

[ Impact: extend allowed filter field string values ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49ED8DEB.6000700@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-21 11:58:28 +02:00
Li Zefan
e8082f3f5a tracing/filters: don't remove old filters when failed to write subsys->filter
If writing subsys->filter returns EINVAL or ENOSPC, the original
filters in subsys/ and subsys/events/ will be removed. This is
definitely wrong.

[ Impact: fix filter setting semantics on error condition ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49ED8DD2.2070700@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-21 11:58:27 +02:00
Gautham R Shenoy
6e29ec5701 sched: Replace first_cpu() with cpumask_first() in ILB nomination code
Stephen Rothwell reported this build warning:

>  kernel/sched.c: In function 'find_new_ilb':
>  kernel/sched.c:4355: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
>
> Possibly caused by commit f711f6090a
> ("sched: Nominate idle load balancer from a semi-idle package") from
> the sched tree.  Should this call to first_cpu be cpumask_first?

For !(CONFIG_SCHED_MC || CONFIG_SCHED_SMT), find_new_ilb() nominates the
Idle load balancer as the first cpu from the nohz.cpu_mask.

This code uses the older API first_cpu(). Replace it with cpumask_first(),
which is the correct API here.

[ Impact: cleanup, address build warning ]

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090421031049.GA4140@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-21 08:06:05 +02:00
Al Viro
24b6f16ecf No need for crossing to mountpoint in audit_tag_tree()
is_under() will DTRT anyway.  And yes, is_subdir() behaviour
is intentional.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20 23:01:15 -04:00
Steven Rostedt
cb4764a6db tracing: use nowakeup version of commit for function event trace tests
The startup tests for the event tracer also runs with the function
tracer enabled. The "wakeup" version of the trace commit was used
which can grab spinlocks. If a task was preempted by an NMI
that called a function being traced, it could deadlock due to the
function tracer trying to grab the same lock.

Thanks to Frederic Weisbecker for pointing out where the bug was.

Reported-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 18:16:44 -04:00
Steven Rostedt
aa18efb2a2 tracing: use recursive counter over irq level
Althought using the irq level (hardirq_count, softirq_count and in_nmi)
was nice to detect bad recursion right away, but since the counters are
not atomically updated with respect to the interrupts, the function tracer
might trigger the test from an interrupt handler before the hardirq_count
is updated. This will trigger a false warning.

This patch converts the recursive detection to a simple counter.
If the depth is greater than 16 then the recursive detection will trigger.
16 is more than enough for any nested interrupts.

[ Impact: fix false positive trace recursion detection ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 16:16:11 -04:00
Peter Zijlstra
ff743345bf sched: remove extra call overhead for schedule()
Lai Jiangshan's patch reminded me that I promised Nick to remove
that extra call overhead in schedule().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090313112300.927414207@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-20 20:49:53 +02:00
Ingo Molnar
f1f9b3b179 perfcounters, sched: remove __task_delta_exec()
This function was left orphan by the latest round of sw-counter
cleanups.

[ Impact: remove unused kernel function ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-20 20:38:21 +02:00
Steven Rostedt
e395898e98 tracing: remove recursive test from ring_buffer_event_discard
The ring_buffer_event_discard is not tied to ring_buffer_lock_reserve.
It can be called inside or outside the reserve/commit. Even if it
is called inside the reserve/commit the commit part must also be called.

Only ring_buffer_discard_commit can be used as a replacement for
ring_buffer_unlock_commit.

This patch removes the trace_recursive_unlock from ring_buffer_event_discard
since it would be the wrong place to do so.

[Impact: prevent breakage in trace recursive testing ]

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 13:32:44 -04:00
Steven Rostedt
17487bfeb6 tracing: fix recursive test level calculation
The recursive tests to detect same level recursion in the ring buffers
did not account for the hard/softirq_counts to be shifted. Thus the
numbers could be larger than then mask to be tested.

This patch includes the shift for the calculation of the irq depth.

[ Impact: stop false positives in trace recursion detection ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 13:24:21 -04:00
Steven Rostedt
28d20e2d6e tracing/events: call the correct event trace selftest init function
The late_initcall calls a helper function instead of the proper
init event selftest function.

This update may have been lost due to conflicting merges.

[ Impact: fix compiler warning and call extended event trace self tests ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 12:12:44 -04:00
Steven Rostedt
a7abe97fd8 tracing: rename EVENT_TRACER config to ENABLE_EVENT_TRACING
Currently we have two configs: EVENT_TRACING and EVENT_TRACER.
All tracers enable EVENT_TRACING. The EVENT_TRACER is only a
convenience to enable the EVENT_TRACING when no other tracers
are enabled.

The names EVENT_TRACER and EVENT_TRACING are too similar and confusing.
This patch renames EVENT_TRACER to ENABLE_EVENT_TRACING to be more
appropriate to what it actually does, as well as add a comment in
the help menu to explain the option's purpose.

[ Impact: rename config option to reduce confusion ]

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 17:55:35 +02:00
Steven Rostedt
4ed9f0716e tracing: create menuconfig for tracing infrastructure
During testing we often use randconfig to test various kernels.
The current configuration set up does not give an easy way to disable
all tracing with a single config. The case where randconfig would
test all tracing disabled is very unlikely.

This patch adds a config option to enable or disable all tracing.
It is hooked into the tracing menu just like other submenus are done.

[ Impact: allow randconfig to easily produce all traces disabled ]

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 11:07:49 -04:00
Steven Rostedt
9ae5b87900 tracing: change branch profiling to a choice selection
This patch makes the branch profiling into a choice selection:

  None               - no branch profiling
  likely/unlikely    - only profile likely/unlikely branches
  all                - profile all branches

The all profiler will also enable the likely/unlikely branches.

This does not change the way the profiler works or the dependencies
between the profilers.

What this patch does, is keep the branch profiling from being selected
by an allyesconfig make. The branch profiler is very intrusive and
it is known to break various architecture builds when selected as an
allyesconfig.

[ Impact: prevent branch profiler from being selected in allyesconfig ]

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-20 11:07:48 -04:00
Frederic Weisbecker
f3b9aae162 tracing/ring-buffer: Add unlock recursion protection on discard
The pair of helpers trace_recursive_lock() and trace_recursive_unlock()
have been introduced recently to provide generic tracing recursion
protection.

They are used in a symetric way:

 - trace_recursive_lock() on buffer reserve
 - trace_recursive_unlock() on buffer commit

However sometimes, we don't commit but discard on entry
to the buffer, ie: in case of filter checking.

Then we must also unlock the recursion protection on discard time,
otherwise the tracing gets definitely deactivated and a warning
is raised spuriously, such as:

111.119821] ------------[ cut here ]------------
[  111.119829] WARNING: at kernel/trace/ring_buffer.c:1498 ring_buffer_lock_reserve+0x1b7/0x1d0()
[  111.119835] Hardware name: AMILO Li 2727
[  111.119839] Modules linked in:
[  111.119846] Pid: 5731, comm: Xorg Tainted: G        W  2.6.30-rc1 #69
[  111.119851] Call Trace:
[  111.119863]  [<ffffffff8025ce68>] warn_slowpath+0xd8/0x130
[  111.119873]  [<ffffffff8028a30f>] ? __lock_acquire+0x19f/0x1ae0
[  111.119882]  [<ffffffff8028a30f>] ? __lock_acquire+0x19f/0x1ae0
[  111.119891]  [<ffffffff802199b0>] ? native_sched_clock+0x20/0x70
[  111.119899]  [<ffffffff80286dee>] ? put_lock_stats+0xe/0x30
[  111.119906]  [<ffffffff80286eb8>] ? lock_release_holdtime+0xa8/0x150
[  111.119913]  [<ffffffff802c8ae7>] ring_buffer_lock_reserve+0x1b7/0x1d0
[  111.119921]  [<ffffffff802cd110>] trace_buffer_lock_reserve+0x30/0x70
[  111.119930]  [<ffffffff802ce000>] trace_current_buffer_lock_reserve+0x20/0x30
[  111.119939]  [<ffffffff802474e8>] ftrace_raw_event_sched_switch+0x58/0x100
[  111.119948]  [<ffffffff808103b7>] __schedule+0x3a7/0x4cd
[  111.119957]  [<ffffffff80211b56>] ? ftrace_call+0x5/0x2b
[  111.119964]  [<ffffffff80211b56>] ? ftrace_call+0x5/0x2b
[  111.119971]  [<ffffffff80810c08>] schedule+0x18/0x40
[  111.119977]  [<ffffffff80810e09>] preempt_schedule+0x39/0x60
[  111.119985]  [<ffffffff80813bd3>] _read_unlock+0x53/0x60
[  111.119993]  [<ffffffff807259d2>] sock_def_readable+0x72/0x80
[  111.120002]  [<ffffffff807ad5ed>] unix_stream_sendmsg+0x24d/0x3d0
[  111.120011]  [<ffffffff807219a3>] sock_aio_write+0x143/0x160
[  111.120019]  [<ffffffff80211b56>] ? ftrace_call+0x5/0x2b
[  111.120026]  [<ffffffff80721860>] ? sock_aio_write+0x0/0x160
[  111.120033]  [<ffffffff80721860>] ? sock_aio_write+0x0/0x160
[  111.120042]  [<ffffffff8031c283>] do_sync_readv_writev+0xf3/0x140
[  111.120049]  [<ffffffff80211b56>] ? ftrace_call+0x5/0x2b
[  111.120057]  [<ffffffff80276ff0>] ? autoremove_wake_function+0x0/0x40
[  111.120067]  [<ffffffff8045d489>] ? cap_file_permission+0x9/0x10
[  111.120074]  [<ffffffff8045c1e6>] ? security_file_permission+0x16/0x20
[  111.120082]  [<ffffffff8031cab4>] do_readv_writev+0xd4/0x1f0
[  111.120089]  [<ffffffff80211b56>] ? ftrace_call+0x5/0x2b
[  111.120097]  [<ffffffff80211b56>] ? ftrace_call+0x5/0x2b
[  111.120105]  [<ffffffff8031cc18>] vfs_writev+0x48/0x70
[  111.120111]  [<ffffffff8031cd65>] sys_writev+0x55/0xc0
[  111.120119]  [<ffffffff80211e32>] system_call_fastpath+0x16/0x1b
[  111.120125] ---[ end trace 15605f4e98d5ccb5 ]---

[ Impact: fix spurious warning triggering tracing shutdown ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-04-20 10:59:20 +02:00
Frederic Weisbecker
e057a5e564 tracing/core: Add current context on tracing recursion warning
In case of tracing recursion detection, we only get the stacktrace.
But the current context may be very useful to debug the issue.

This patch adds the softirq/hardirq/nmi context with the warning
using lockdep context display to have a familiar output.

v2: Use printk_once()
v3: drop {hardirq,softirq}_context which depend on lockdep,
    only keep what is part of current->trace_recursion,
    sufficient to debug the warning source.

[ Impact: print context necessary to debug recursion ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-04-19 23:38:12 +02:00
Rafael J. Wysocki
6a7c7eaf71 PM/Suspend: Introduce two new platform callbacks to avoid breakage
Commit 900af0d973 (PM: Change suspend
code ordering) changed the ordering of suspend code in such a way
that the platform .prepare() callback is now executed after the
device drivers' late suspend callbacks have run.  Unfortunately, this
turns out to break ARM platforms that need to talk via I2C to power
control devices during the .prepare() callback.

For this reason introduce two new platform suspend callbacks,
.prepare_late() and .wake(), that will be called just prior to
disabling non-boot CPUs and right after bringing them back on line,
respectively, and use them instead of .prepare() and .finish() for
ACPI suspend.  Make the PM core execute the .prepare() and .finish()
platform suspend callbacks where they were executed previously (that
is, right after calling the regular suspend methods provided by
device drivers and right before executing their regular resume
methods, respectively).

It is not necessary to make analogous changes to the hibernation
code and data structures at the moment, because they are only used
by ACPI platforms.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Len Brown <len.brown@intel.com>
2009-04-19 20:08:42 +02:00
Linus Torvalds
ff54250a0e Remove 'recurse into child resources' logic from 'reserve_region_with_split()'
This function is not actually used right now, since the original use
case for it was done with insert_resource_expand_to_fit() instead.

However, we now have another usage case that wants to basically do a
"reserve IO resource, splitting around existing resources", however that
one doesn't actually want the "recurse into the conflicting resource"
logic at all.

And since recursing into the conflicting resource was the most complex
part, and isn't wanted, just remove it.  Maybe we'll some day want both
versions, but we can just resurrect the logic then.

Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-18 21:44:24 -07:00
Steven Rostedt
3189cdb316 tracing: protect trace_printk from recursion
trace_printk can be called from any context, including NMIs.
If this happens, then we must test for for recursion before
grabbing any spinlocks.

This patch prevents trace_printk from being called recursively.

[ Impact: prevent hard lockup in lockdep event tracer ]

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 16:21:35 -04:00
Steven Rostedt
261842b7c9 tracing: add same level recursion detection
The tracing infrastructure allows for recursion. That is, an interrupt
may interrupt the act of tracing an event, and that interrupt may very well
perform its own trace. This is a recursive trace, and is fine to do.

The problem arises when there is a bug, and the utility doing the trace
calls something that recurses back into the tracer. This recursion is not
caused by an external event like an interrupt, but by code that is not
expected to recurse. The result could be a lockup.

This patch adds a bitmask to the task structure that keeps track
of the trace recursion. To find the interrupt depth, the following
algorithm is used:

  level = hardirq_count() + softirq_count() + in_nmi;

Here, level will be the depth of interrutps and softirqs, and even handles
the nmi. Then the corresponding bit is set in the recursion bitmask.
If the bit was already set, we know we had a recursion at the same level
and we warn about it and fail the writing to the buffer.

After the data has been committed to the buffer, we clear the bit.
No atomics are needed. The only races are with interrupts and they reset
the bitmask before returning anywy.

[ Impact: detect same irq level trace recursion ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 16:21:32 -04:00
Steven Rostedt
12acd473d4 tracing: add EXPORT_SYMBOL_GPL for trace commits
Not all the necessary symbols were exported to allow for tracing
by modules. This patch adds them in.

[ Impact: allow modules to commit data to the ring buffer ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 16:21:28 -04:00
Tom Zanussi
ac1adc55fc tracing/filters: add filter_mutex to protect filter predicates
This patch adds a filter_mutex to prevent the filter predicates from
being accessed concurrently by various external functions.

It's based on a previous patch by Li Zefan:
        "[PATCH 7/7] tracing/filters: make filter preds RCU safe"

v2 changes:

- fixed wrong value returned in a add_subsystem_pred() failure case
  noticed by Li Zefan.

[ Impact: fix trace filter corruption/crashes on parallel access ]

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1239946028.6639.13.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-17 18:28:27 +02:00
Li Zefan
339ae5d3c3 tracing: fix file mode of trace and README
trace is read-write and README is read-only.

[ Impact: fix /debug/tracing/ file permissions. ]

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49E7EAB6.4070605@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-17 18:04:36 +02:00
Peter Zijlstra
c8a2500586 lockdep: more robust lockdep_map init sequence
Steven Rostedt reported:

> OK, I think I figured this bug out. This is a lockdep issue with respect
> to tracepoints.
>
> The trace points in lockdep are called all the time. Outside the lockdep
> logic. But if lockdep were to trigger an error / warning (which this run
> did) we might be in trouble. For new locks, like the dentry->d_lock, that
> are created, they will not get a name:
>
> void lockdep_init_map(struct lockdep_map *lock, const char *name,
>                       struct lock_class_key *key, int subclass)
> {
>         if (unlikely(!debug_locks))
>                 return;
>
> When a problem is found by lockdep, debug_locks becomes false. Thus we
> stop allocating names for locks. This dentry->d_lock I had, now has no
> name. Worse yet, I have CONFIG_DEBUG_VM set, that scrambles non
> initialized memory. Thus, when the trace point was hit, it had junk for
> the lock->name, and the machine crashed.

Ah, nice catch. I think we should put at least the name in regardless.

Ensure we at least initialize the trivial entries of the depmap so that
they can be relied upon, even when lockdep itself decided to pack up and
go home.

[ Impact: fix lock tracing after lockdep warnings. ]

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1239954049.23397.4156.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-17 18:00:00 +02:00
Steven Rostedt
9ea21c1ecd tracing/events: perform function tracing in event selftests
We can find some bugs in the trace events if we stress the writes as well.
The function tracer is a good way to stress the events.

[ Impact: extend scope of event tracer self-tests ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20090416161746.604786131@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-17 17:10:35 +02:00
Avadh Patel
69abe6a5d1 tracing: add saved_cmdlines file to show cached task comms
Export the cached task comms to userspace. This allows user apps to translate
the pids from a trace into their respective task command lines.

[ Impact: let userspace apps reading binary buffer know comm's of pids ]

Signed-off-by: Avadh Patel <avadh4all@gmail.com>
[ added error checking and use of buf pointer to index file_buf ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 17:04:12 +02:00
Steven Rostedt
d1b182a8d4 tracing/events/ring-buffer: expose format of ring buffer headers to users
Currently, every thing needed to read the binary output from the
ring buffers is available, with the exception of the way the ring
buffers handles itself internally.

This patch creates two special files in the debugfs/tracing/events
directory:

 # cat /debug/tracing/events/header_page
        field: u64 timestamp;   offset:0;       size:8;
        field: local_t commit;  offset:8;       size:8;
        field: char data;       offset:16;      size:4080;

 # cat /debug/tracing/events/header_event
        type        :    2 bits
        len         :    3 bits
        time_delta  :   27 bits
        array       :   32 bits

        padding     : type == 0
        time_extend : type == 1
        data        : type == 3

This is to allow a userspace app to see if the ring buffer format changes
or not.

[ Impact: allow userspace apps to know of ringbuffer format changes ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 17:03:28 +02:00
Steven Rostedt
e6187007d6 tracing/events: add startup tests for events
As events start to become popular, and the new way to add tracing
infrastructure into ftrace, it is important to catch any problems
that might happen with a mistake in the TRACE_EVENT macro.

This patch introduces a startup self test on the registered trace
events. Note, it can only do a generic test, any type of testing that
needs more involement is needed to be implemented by the tracepoint
creators.

The test goes down one by one enabling a trace point and running
some random tasks (random in the sense that I just made them up).
Those tasks are creating threads, grabbing mutexes and spinlocks
and using workqueues.

After testing each event individually, it does the same test after
enabling each system of trace points. Like sched, irq, lockdep.

Then finally it enables all tracepoints and performs the tasks again.
The output to the console on bootup will look like this when everything
works:

Running tests on trace events:
Testing event kfree_skb: OK
Testing event kmalloc: OK
Testing event kmem_cache_alloc: OK
Testing event kmalloc_node: OK
Testing event kmem_cache_alloc_node: OK
Testing event kfree: OK
Testing event kmem_cache_free: OK
Testing event irq_handler_exit: OK
Testing event irq_handler_entry: OK
Testing event softirq_entry: OK
Testing event softirq_exit: OK
Testing event lock_acquire: OK
Testing event lock_release: OK
Testing event sched_kthread_stop: OK
Testing event sched_kthread_stop_ret: OK
Testing event sched_wait_task: OK
Testing event sched_wakeup: OK
Testing event sched_wakeup_new: OK
Testing event sched_switch: OK
Testing event sched_migrate_task: OK
Testing event sched_process_free: OK
Testing event sched_process_exit: OK
Testing event sched_process_wait: OK
Testing event sched_process_fork: OK
Testing event sched_signal_send: OK
Running tests on trace event systems:
Testing event system skb: OK
Testing event system kmem: OK
Testing event system irq: OK
Testing event system lockdep: OK
Testing event system sched: OK
Running tests on all trace events:
Testing all events: OK

[ folded in:

  tracing: add #include <linux/delay.h> to fix build failure in test_work()

  This build failure occured on a few rare configs:

   kernel/trace/trace_events.c: In function ‘test_work’:
   kernel/trace/trace_events.c:975: error: implicit declaration of function ‘udelay’
   kernel/trace/trace_events.c:980: error: implicit declaration of function ‘msleep’

  delay.h is included in way too many other headers, hiding cases
  where new usage is added without header inclusion.

  [ Impact: build fix ]

  Signed-off-by: Ingo Molnar <mingo@elte.hu>
]

[ Impact: add event tracer self-tests ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 17:01:37 +02:00
Steven Rostedt
93eb677d74 ftrace: use module notifier for function tracer
The hooks in the module code for the function tracer must be called
before any of that module code runs. The function tracer hooks
modify the module (replacing calls to mcount to nops). If the code
is executed while the change occurs, then the CPU can take a GPF.

To handle the above with a bit of paranoia, I originally implemented
the hooks as calls directly from the module code.

After examining the notifier calls, it looks as though the start up
notify is called before any of the module's code is executed. This makes
the use of the notify safe with ftrace.

Only the startup notify is required to be "safe". The shutdown simply
removes the entries from the ftrace function list, and does not modify
any code.

This change has another benefit. It removes a issue with a reverse dependency
in the mutexes of ftrace_lock and module_mutex.

[ Impact: fix lock dependency bug, cleanup ]

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-17 16:59:15 +02:00
Linus Torvalds
9f76208c33 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Fix branch tracer header
  tracing: Fix power tracer header
2009-04-16 18:17:22 -07:00
Linus Torvalds
d06be22150 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Avoid printing sched_group::__cpu_power for default case
  tracing, sched: mark get_parent_ip() notrace
2009-04-16 18:16:29 -07:00
Linus Torvalds
4d831f53dd Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  kernel/softirq.c: fix sparse warning
  rcu: Make hierarchical RCU less IPI-happy
2009-04-16 17:56:39 -07:00
H Hartley Sweeten
79d381c9f2 kernel/softirq.c: fix sparse warning
Fix sparse warning in kernel/softirq.c.

  warning: do-while statement is not a compound statement

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE1909015F9033@mi8nycmail19.Mi8.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-17 01:57:54 +02:00
Gautham R Shenoy
381512cf3d sched: Avoid printing sched_group::__cpu_power for default case
Commit 46e0bb9c12 ("sched: Print sched_group::__cpu_power
in sched_domain_debug") produces a messy dmesg output while
attempting to print the sched_group::__cpu_power for each
group in the sched_domain hierarchy.

Fix this by avoid printing the __cpu_power for default cases.
(i.e, __cpu_power == SCHED_LOAD_SCALE).

[ Impact: reduce syslog clutter ]

Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Fixed-by: Tony Luck <tony.luck@intel.com>
Cc: a.p.zijlstra@chello.nl
LKML-Reference: <20090414033936.GA534@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-17 00:46:05 +02:00
Li Zefan
f3948f8857 blktrace: fix context-info when mixed-using blk tracer and trace events
When current tracer is set to blk tracer, TRACE_ITER_CONTEXT_INFO is
unset, but actually context-info is printed:

    pdflush-431   [000]   821.181576:   8,0    P   N [pdflush]

And then if we enable TRACE_ITER_CONTEXT_INFO:

    # echo context-info > trace_options

We'll see context-info printed twice. What's worse, when we use blk
tracer and trace events at the same time, we'll see no context-info
for trace events at all:

    jbd2_commit_logging: dev dm-0:8 transaction 333227
    jbd2_end_commit: dev dm-0:8 transaction 333227 head 332814
      rm-25433 [001]  9578.307485:   8,18   m   N cfq25433 slice expired t=0
      rm-25433 [001]  9578.307486:   8,18   m   N cfq25433 put_queue

This patch adds blk_tracer->set_flags(), and context-info flag is unset
only when we set the output to classic mode.

Note after this patch, one should unset context-info explicitly if he
wants to get binary output that can be parsed by blkparse:

    # echo nocontext-info > trace_options
    # echo bin > trace_options
    # echo blk > current_tracer
    # cat trace_pipe | blkparse -i -

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49E54E60.50408@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-16 10:11:01 +02:00
Li Zefan
1d54ad6da9 blktrace: add trace/ to /sys/block/sda
Impact: allow ftrace-plugin blktrace to trace device-mapper devices

To trace a single partition:
  # echo 1 > /sys/block/sda/sda1/enable

To trace the whole sda instead:
  # echo 1 > /sys/block/sda/enable

Thus we also fix an issue reported by Ted, that ftrace-plugin blktrace
can't be used to trace device-mapper devices.

Now:

  # echo 1 > /sys/block/dm-0/trace/enable
  echo: write error: No such device or address
  # mount -t ext4 /dev/dm-0 /mnt
  # echo 1 > /sys/block/dm-0/trace/enable
  # echo blk > /debug/tracing/current_tracer

Reported-by: Theodore Tso <tytso@mit.edu>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Shawn Du <duyuyang@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
LKML-Reference: <49E42665.6020506@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-16 10:10:59 +02:00
Li Zefan
9908c30997 blktrace: support per-partition tracing for ftrace plugin
The previous patch adds support to trace a single partition for
relay+ioctl blktrace, and this patch is for ftrace plugin blktrace:

  # echo 1 > /sys/block/sda/sda7/enable
  # cat start_lba
  102398373
  # cat end_lba
  102703545

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Shawn Du <duyuyang@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
LKML-Reference: <49E42646.4060608@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-16 10:10:58 +02:00
Shawn Du
d0deef5b14 blktrace: support per-partition tracing
Though one can specify '-d /dev/sda1' when using blktrace, it still
traces the whole sda.

To support per-partition tracing, when we start tracing, we initialize
bt->start_lba and bt->end_lba to the start and end sector of that
partition.

Note some actions are per device, thus we don't filter 0-sector events.

The original patch and discussion can be found here:
	http://marc.info/?l=linux-btrace&m=122949374214540&w=2

Signed-off-by: Shawn Du <duyuyang@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
LKML-Reference: <49E42620.4050701@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-16 10:10:57 +02:00
Steven Whitehouse
ff7b1b4f00 perfcounters: export perf_tpcounter_event
Needed for modular tracepoint support.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-16 01:10:04 +02:00
David Howells
5b1d07ed0e RCU: Don't try and predeclare inline funcs as it upsets some versions of gcc
Don't try and predeclare inline funcs like this:

	static inline void wait_migrated_callbacks(void)
	...
	static void _rcu_barrier(enum rcu_barrier type)
	{
		...
		wait_migrated_callbacks();
	}
	...
	static inline void wait_migrated_callbacks(void)
	{
		wait_event(rcu_migrate_wq, !atomic_read(&rcu_migrate_type_count));
	}

as it upsets some versions of gcc under some circumstances:

	kernel/rcupdate.c: In function `_rcu_barrier':
	kernel/rcupdate.c:125: sorry, unimplemented: inlining failed in call to 'wait_migrated_callbacks': function body not available
	kernel/rcupdate.c:152: sorry, unimplemented: called from here

This can be dealt with by simply putting the static variables (rcu_migrate_*)
at the top, and moving the implementation of the function up so that it
replaces its forward declaration.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-15 13:55:14 -07:00
Miao Xie
13318a7186 sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus())
Impact: cleanup

This patch changes cpumask_first(sched_group_cpus()) to group_first_cpu()
for maintainability.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-15 13:34:07 +02:00
Nikanth Karthikesan
297dbf50d7 swap: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Steven Rostedt
ad8d75fff8 tracing/events: move trace point headers into include/trace/events
Impact: clean up

Create a sub directory in include/trace called events to keep the
trace point headers in their own separate directory. Only headers that
declare trace points should be defined in this directory.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 22:05:43 -04:00
Steven Rostedt
61f919a12f tracing/events: fix compile for modules disabled
Impact: compile fix

The addition of TRACE_EVENT for modules breaks the build for when
modules are disabled. This code fixes that.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 22:04:19 -04:00
Stephen Rothwell
19e4529ee7 modules: Fix up build when CONFIG_MODULE_UNLOAD=n.
Commit 3d43321b70 ("modules: sysctl to
block module loading") introduces a modules_disabled variable that is
only defined if CONFIG_MODULE_UNLOAD is enabled, despite being used in
other places. This moves it up and fixes up the build.

  CC      kernel/module.o
kernel/module.c: In function 'sys_init_module':
kernel/module.c:2401: error: 'modules_disabled' undeclared (first use in this function)
kernel/module.c:2401: error: (Each undeclared identifier is reported only once
kernel/module.c:2401: error: for each function it appears in.)
make[1]: *** [kernel/module.o] Error 1
make: *** [kernel/module.o] Error 2

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: James Morris <jmorris@namei.org>
2009-04-15 08:17:31 +10:00
Steven Rostedt
6d723736e4 tracing/events: add support for modules to TRACE_EVENT
Impact: allow modules to add TRACE_EVENTS on load

This patch adds the final hooks to allow modules to use the TRACE_EVENT
macro. A notifier and a data structure are used to link the TRACE_EVENTs
defined in the module to connect them with the ftrace event tracing system.

It also adds the necessary automated clean ups to the trace events when a
module is removed.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:58:03 -04:00
Steven Rostedt
17c873ec28 tracing/events: add export symbols for trace events in modules
Impact: let modules add trace events

The trace event code requires some functions to be exported to allow
modules to use TRACE_EVENT. This patch adds EXPORT_SYMBOL_GPL to the
necessary functions.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:58:01 -04:00
Steven Rostedt
a59fd60272 tracing/events: convert event call sites to use a link list
Impact: makes it possible to define events in modules

The events are created by reading down the section that they are linked
in by the macros. But this is not scalable to modules. This patch converts
the manipulations to use a global link list, and on boot up it adds
the items in the section to the list.

This change will allow modules to add their tracing events to the list as
well.

Note, this change alone does not permit modules to use the TRACE_EVENT macros,
but the change is needed for them to eventually do so.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:58:00 -04:00
Steven Rostedt
f42c85e74f tracing/events: move the ftrace event tracing code to core
This patch moves the ftrace creation into include/trace/ftrace.h and
simplifies the work of developers in adding new tracepoints.
Just the act of creating the trace points in include/trace and including
define_trace.h will create the events in the debugfs/tracing/events
directory.

This patch removes the need of include/trace/trace_events.h

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:57:59 -04:00
Steven Rostedt
97f2025153 tracing/events: move declarations from trace directory to core include
In preparation to allowing trace events to happen in modules, we need
to move some of the local declarations in the kernel/trace directory
into include/linux.

This patch simply moves the declarations and performs no context changes.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:57:58 -04:00
Steven Rostedt
9504504cba tracing: make trace_seq operations available for core kernel
In the process to make TRACE_EVENT macro work for modules, the trace_seq
operations must be available for core kernel code.

These operations are quite useful and can be used for other implementations.

The main idea is that we create a trace_seq handle that acts very much
like the seq_file handle.

	struct trace_seq *s = kmalloc(sizeof(*s, GFP_KERNEL);

	trace_seq_init(s);
	trace_seq_printf(s, "some data %d\n", variable);

	printk("%s", s->buffer);

The main use is to allow a top level function call several other functions
that may store printf like data into the buffer. Then at the end, the top
level function can process all the data with any method it would like to.
It could be passed to userspace, output via printk or even use seq_file:

	trace_seq_to_user(s, ubuf, cnt);
	seq_puts(m, s->buffer);

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:57:57 -04:00
Steven Rostedt
a8d154b009 tracing: create automated trace defines
This patch lowers the number of places a developer must modify to add
new tracepoints. The current method to add a new tracepoint
into an existing system is to write the trace point macro in the
trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
DECLARE_TRACE, then they must add the same named item into the C file
with the macro DEFINE_TRACE(name) and then add the trace point.

This change cuts out the needing to add the DEFINE_TRACE(name).
Every file that uses the tracepoint must still include the trace/<type>.h
file, but the one C file must also add a define before the including
of that file.

 #define CREATE_TRACE_POINTS
 #include <trace/mytrace.h>

This will cause the trace/mytrace.h file to also produce the C code
necessary to implement the trace point.

Note, if more than one trace/<type>.h is used to create the C code
it is best to list them all together.

 #define CREATE_TRACE_POINTS
 #include <trace/foo.h>
 #include <trace/bar.h>
 #include <trace/fido.h>

Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
the cleaner solution of the define above the includes over my first
design to have the C code include a "special" header.

This patch converts sched, irq and lockdep and skb to use this new
method.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:57:28 -04:00
Ingo Molnar
56449f437a tracing: make the trace clocks available generally
Jeremy Fitzhardinge reported this build failure:

 LD	 .tmp_vmlinux1
 arch/x86/kernel/built-in.o: In function `ds_take_timestamp':
 git/linux/arch/x86/kernel/ds.c:1380: undefined reference to `trace_clock_global'
 git/linux/arch/x86/kernel/ds.c:1380: undefined reference to `trace_clock_global'

Which is due to !CONFIG_TRACING && CONFIG_X86_DS=y.

Expose the trace clock code to CONFIG_X86_DS as well.

[ Unfortunately librarizing doesnt work well - ancient architectures
  with no raw_local_irq_save() primitive break the build. ]

Reported-by: Jeremy Fitzhardinge <jeremy@goop.org>
LKML-Reference: <49E4413F.7070700@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 18:35:13 +02:00
Johannes Weiner
78ddb08feb wait: don't use __wake_up_common()
'777c6c5 wait: prevent exclusive waiter starvation' made
__wake_up_common() global to be used from abort_exclusive_wait().

It was needed to do a wake-up with the waitqueue lock held while
passing down a key to the wake-up function.

Since '4ede816 epoll keyed wakeups: add __wake_up_locked_key() and
__wake_up_sync_key()' there is an appropriate wrapper for this case:
__wake_up_locked_key().

Use it here and make __wake_up_common() private to the scheduler
again.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1239720785-19661-1-git-send-email-hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 17:17:16 +02:00
Steven Rostedt
ea20d9293c tracing: consolidate trace and trace_event headers
Impact: clean up

Neil Horman (et. al.) criticized the way the trace events were broken up
into two files. The reason for that was that ftrace needed to separate out
the declarations from where the #include <linux/tracepoint.h> was used.
It then dawned on me that the tracepoint.h header only needs to define the
TRACE_EVENT macro if it is not already defined.

The solution is simply to test if TRACE_EVENT is defined, and if it is not
then the linux/tracepoint.h header can define it. This change consolidates
all the <traces>.h and <traces>_event_types.h into the <traces>.h file.

Reported-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Theodore Tso <tytso@mit.edu>
Reported-by: Jiaying Zhang <jiayingz@google.com>
Cc: Zhaolei <zhaolei@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 09:43:40 -04:00
Pallipadi, Venkatesh
6ec3cfeca0 x86, irq: Remove IRQ_DISABLED check in process context IRQ move
As discussed in the thread here:

  http://marc.info/?l=linux-kernel&m=123964468521142&w=2

Eric W. Biederman observed:

> It looks like some additional bugs have slipped in since last I looked.
>
> set_irq_affinity does this:
> ifdef CONFIG_GENERIC_PENDING_IRQ
>        if (desc->status & IRQ_MOVE_PCNTXT || desc->status & IRQ_DISABLED) {
>                cpumask_copy(desc->affinity, cpumask);
>                desc->chip->set_affinity(irq, cpumask);
>        } else {
>                desc->status |= IRQ_MOVE_PENDING;
>                cpumask_copy(desc->pending_mask, cpumask);
>        }
> #else
>
> That IRQ_DISABLED case is a software state and as such it has nothing to
> do with how safe it is to move an irq in process context.

[...]

>
> The only reason we migrate MSIs in interrupt context today is that there
> wasn't infrastructure for support migration both in interrupt context
> and outside of it.

Yes. The idea here was to force the MSI migration to happen in process
context. One of the patches in the series did

        disable_irq(dev->irq);
        irq_set_affinity(dev->irq, cpumask_of(dev->cpu));
        enable_irq(dev->irq);

with the above patch adding irq/manage code check for interrupt disabled
and moving the interrupt in process context.

IIRC, there was no IRQ_MOVE_PCNTXT when we were developing this HPET
code and we ended up having this ugly hack. IRQ_MOVE_PCNTXT was there
when we eventually submitted the patch upstream. But, looks like I did a
blind rebasing instead of using IRQ_MOVE_PCNTXT in hpet MSI code.

Below patch fixes this. i.e., revert commit 932775a4ab
and add PCNTXT to HPET MSI setup. Also removes copying of desc->affinity
in generic code as set_affinity routines are doing it internally.

Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Li Shaohua" <shaohua.li@intel.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: "lcm@us.ibm.com" <lcm@us.ibm.com>
Cc: suresh.b.siddha@intel.com
LKML-Reference: <20090413222058.GB8211@linux-os.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 15:21:13 +02:00
Gautham R Shenoy
e790fb0ba6 sched: Nominate a power-efficient ilb in select_nohz_balancer()
The CPU that first goes idle becomes the idle-load-balancer and remains
that until either it picks up a task or till all the CPUs of the system
goes idle.

Optimize this further to allow it to relinquish it's post
once all it's siblings in the power-aware sched_domain go idle, thereby
allowing the whole package-core to go idle. While relinquising the post,
nominate another an idle-load balancer from a semi-idle core/package.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090414045535.7645.31641.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 11:49:19 +02:00
Gautham R Shenoy
f711f6090a sched: Nominate idle load balancer from a semi-idle package.
Currently the nomination of idle-load balancer is done by choosing the first
idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
such an idle cpu could come from a completely idle core/package thereby
preventing the whole core/package from being in a low-power state.

For eg, consider a quad-core dual package system. The cpu numbering need
not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
as fewer Packages/Cores active. But the current idle load balancer logic
goes against this by choosing the first_cpu in the nohz.cpu_mask and not
taking the system topology into consideration.

Improve the algorithm to nominate the idle load balancer from a semi idle
cores/packages thereby increasing the probability of the cores/packages being
in deeper sleep states for longer duration.

The algorithm is activated only when sched_mc/smt_power_savings != 0.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090414045530.7645.12175.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 11:49:19 +02:00
Paul E. McKenney
7ba5c840e6 rcu: Add __rcu_pending tracing to hierarchical RCU
Add tracing to __rcu_pending() to provide information on why RCU
processing was kicked off.  This is helpful for debugging hierarchical
RCU, and might also be helpful in learning how hierarchical RCU operates.

Located-by: Anton Blanchard <anton@au1.ibm.com>
Tested-by: Anton Blanchard <anton@au1.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: anton@samba.org
Cc: akpm@linux-foundation.org
Cc: dipankar@in.ibm.com
Cc: manfred@colorfullife.com
Cc: cl@linux-foundation.org
Cc: josht@linux.vnet.ibm.com
Cc: schamp@sgi.com
Cc: niv@us.ibm.com
Cc: dvhltc@us.ibm.com
Cc: ego@in.ibm.com
Cc: laijs@cn.fujitsu.com
Cc: rostedt@goodmis.org
Cc: peterz@infradead.org
Cc: penberg@cs.helsinki.fi
Cc: andi@firstfloor.org
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Reference: <1239683479943-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 11:33:43 +02:00
Paul E. McKenney
ef631b0ca0 rcu: Make hierarchical RCU less IPI-happy
This patch fixes a hierarchical-RCU performance bug located by Anton
Blanchard.  The problem stems from a misguided attempt to provide a
work-around for jiffies-counter failure.  This work-around uses a per-CPU
n_rcu_pending counter, which is incremented on each call to rcu_pending(),
which in turn is called from each scheduling-clock interrupt.  Each CPU
then treats this counter as a surrogate for the jiffies counter, so
that if the jiffies counter fails to advance, the per-CPU n_rcu_pending
counter will cause RCU to invoke force_quiescent_state(), which in turn
will (among other things) send resched IPIs to CPUs that have thus far
failed to pass through an RCU quiescent state.

Unfortunately, each CPU resets only its own counter after sending a
batch of IPIs.  This means that the other CPUs will also (needlessly)
send -another- round of IPIs, for a full N-squared set of IPIs in the
worst case every three scheduler-clock ticks until the grace period
finally ends.  It is not reasonable for a given CPU to reset each and
every n_rcu_pending for all the other CPUs, so this patch instead simply
disables the jiffies-counter "training wheels", thus eliminating the
excessive IPIs.

Note that the jiffies-counter IPIs do not have this problem due to
the fact that the jiffies counter is global, so that the CPU sending
the IPIs can easily reset things, thus preventing the other CPUs from
sending redundant IPIs.

Note also that the n_rcu_pending counter remains, as it will continue to
be used for tracing.  It may also see use to update the jiffies counter,
should an appropriate kick-the-jiffies-counter API appear.

Located-by: Anton Blanchard <anton@au1.ibm.com>
Tested-by: Anton Blanchard <anton@au1.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: anton@samba.org
Cc: akpm@linux-foundation.org
Cc: dipankar@in.ibm.com
Cc: manfred@colorfullife.com
Cc: cl@linux-foundation.org
Cc: josht@linux.vnet.ibm.com
Cc: schamp@sgi.com
Cc: niv@us.ibm.com
Cc: dvhltc@us.ibm.com
Cc: ego@in.ibm.com
Cc: laijs@cn.fujitsu.com
Cc: rostedt@goodmis.org
Cc: peterz@infradead.org
Cc: penberg@cs.helsinki.fi
Cc: andi@firstfloor.org
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Reference: <12396834793575-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 11:31:50 +02:00
Zhaolei
557055bebe tracing: Fix branch tracer header
Before patch:

  # tracer: branch
  #
  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
  #              | |       |          |         |
             <...>-2981  [000] 24008.872738: [  ok  ] trace_irq_handler_exit:irq_event_types.h:41
             <...>-2981  [000] 24008.872742: [  ok  ] note_interrupt:spurious.c:229
  ...

After patch:

  # tracer: branch
  #
  #           TASK-PID    CPU#    TIMESTAMP  CORRECT  FUNC:FILE:LINE
  #              | |       |          |         |       |
             <...>-2985  [000] 26329.142970: [  ok  ] slab_free:slub.c:1776
             <...>-2985  [000] 26329.142972: [  ok  ] trace_kmem_cache_free:kmem_event_types.h:191
  ...

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <49E2F19A.3040006@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 02:30:36 +02:00
Lai Jiangshan
132380a06b tracing, sched: mark get_parent_ip() notrace
Impact: remove overly redundant tracing entries

When tracer is "function" or "function_graph", way too much
"get_parent_ip" entries are recorded in ring_buffer.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49D458B1.5000703@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 02:11:05 +02:00
Andi Kleen
3d26dcf767 kernel/sys.c: clean up sys_shutdown exit path
Impact: cleanup, fix

Clean up sys_shutdown() exit path.  Factor out common code.  Return
correct error code instead of always 0 on failure.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:32 -07:00
Oleg Nesterov
f1671f6d78 ptrace: fix exit_ptrace() vs ptrace_traceme() race
Pointed out by Roland.  The bug was recently introduced by me in
"forget_original_parent: split out the un-ptrace part", commit
39c626ae47.

Since that patch we have a window after exit_ptrace() drops tasklist and
before forget_original_parent() takes it again.  In this window the child
can do ptrace(PTRACE_TRACEME) and nobody can untrace this child after
that.

Change ptrace_traceme() to not attach to the exiting ->real_parent.  We
don't report the error in this case, we pretend we attach right before
->real_parent calls exit_ptrace() which should untrace us anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:31 -07:00
Peter Zijlstra
4be6f6bb66 mm: move the scan_unevictable_pages sysctl to the vm table
vm knobs should go in the vm table.  Probably too late for
randomize_va_space though.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:28 -07:00
Tom Zanussi
0a19e53c15 tracing/filters: allow on-the-fly filter switching
This patch allows event filters to be safely removed or switched
on-the-fly while avoiding the use of rcu or the suspension of tracing of
previous versions.

It does it by adding a new filter_pred_none() predicate function which
does nothing and by never deallocating either the predicates or any of
the filter_pred members used in matching; the predicate lists are
allocated and initialized during ftrace_event_calls initialization.

Whenever a filter is removed or replaced, the filter_pred_* functions
currently in use by the affected ftrace_event_call are immediately
switched over to to the filter_pred_none() function, while the rest of
the filter_pred members are left intact, allowing any currently
executing filter_pred_* functions to finish up, using the values they're
currently using.

In the case of filter replacement, the new predicate values are copied
into the old predicates after the above step, and the filter_pred_none()
functions are replaced by the filter_pred_* functions for the new
filter.  In this case, it is possible though very unlikely that a
previous filter_pred_* is still running even after the
filter_pred_none() switch and the switch to the new filter_pred_*.  In
that case, however, because nothing has been deallocated in the
filter_pred, the worst that can happen is that the old filter_pred_*
function sees the new values and as a result produces either a false
positive or a false negative, depending on the values it finds.

So one downside to this method is that rarely, it can produce a bad
match during the filter switch, but it should be possible to live with
that, IMHO.

The other downside is that at least in this patch the predicate lists
are always pre-allocated, taking up memory from the start.  They could
probably be allocated on first-use, and de-allocated when tracing is
completely stopped - if this patch makes sense, I could create another
one to do that later on.

Oh, and it also places a restriction on the size of __arrays in events,
currently set to 128, since they can't be larger than the now embedded
str_val arrays in the filter_pred struct.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1239610670.6660.49.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:03:55 +02:00
Ingo Molnar
b5c851a88a Merge branch 'linus' into tracing/core
Merge reason: merge latest tracing fixes to avoid conflicts in
              kernel/trace/trace_events_filter.c with upcoming change

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:02:22 +02:00
Tom Zanussi
eb02ce017d tracing/filters: use ring_buffer_discard_commit() in filter_check_discard()
This patch changes filter_check_discard() to make use of the new
ring_buffer_discard_commit() function and modifies the current users to
call the old commit function in the non-discard case.

It also introduces a version of filter_check_discard() that uses the
global trace buffer (filter_current_check_discard()) for those cases.

v2 changes:

- fix compile error noticed by Ingo Molnar

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
LKML-Reference: <1239178554.10295.36.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:00:56 +02:00
Tom Zanussi
5f77a88b3f tracing/infrastructure: separate event tracer from event support
Add a new config option, CONFIG_EVENT_TRACING that gets selected
when CONFIG_TRACING is selected and adds everything needed by the stuff
in trace_export - basically all the event tracing support needed by e.g.
bprint, minus the actual events, which are only included if
CONFIG_EVENT_TRACER is selected.

So CONFIG_EVENT_TRACER can be used to turn on or off the generated events
(what I think of as the 'event tracer'), while CONFIG_EVENT_TRACING turns
on or off the base event tracing support used by both the event tracer and
the other things such as bprint that can't be configured out.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fweisbec@gmail.com
LKML-Reference: <1239178441.10295.34.camel@tropicana>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:00:55 +02:00
Steven Rostedt
77d9f465d4 tracing/filters: use ring_buffer_discard_commit for discarded events
The ring_buffer_discard_commit makes better usage of the ring_buffer
when an event has been discarded. It tries to remove it completely if
possible.

This patch converts the trace event filtering to use
ring_buffer_discard_commit instead of the ring_buffer_event_discard.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:00:54 +02:00
Steven Rostedt
fa1b47dd85 ring-buffer: add ring_buffer_discard_commit
The ring_buffer_discard_commit is similar to ring_buffer_event_discard
but it can only be done on an event that has yet to be commited.
Unpredictable results can happen otherwise.

The main difference between ring_buffer_discard_commit and
ring_buffer_event_discard is that ring_buffer_discard_commit will try
to free the data in the ring buffer if nothing has addded data
after the reserved event. If something did, then it acts almost the
same as ring_buffer_event_discard followed by a
ring_buffer_unlock_commit.

Note, either ring_buffer_commit_discard and ring_buffer_unlock_commit
can be called on an event, not both.

This commit also exports both discard functions to be usable by
GPL modules.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:00:53 +02:00
Tom Zanussi
e45f2e2bd2 tracing/filters: add TRACE_EVENT_FORMAT_NOFILTER event macro
Frederic Weisbecker suggested that the trace_special event shouldn't be
filterable; this patch adds a TRACE_EVENT_FORMAT_NOFILTER event macro
that allows an event format to be exported without having a filter
attached, and removes filtering from the trace_special event.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:00:51 +02:00
Tom Zanussi
e1112b4d96 tracing/filters: add run-time field descriptions to TRACE_EVENT_FORMAT events
This patch adds run-time field descriptions to all the event formats
exported using TRACE_EVENT_FORMAT.  It also hooks up all the tracers
that use them (i.e. the tracers in the 'ftrace subsystem') so they can
also have their output filtered by the event-filtering mechanism.

When I was testing this, there were a couple of things that fooled me
into thinking the filters weren't working, when actually they were -
I'll mention them here so others don't make the same mistakes (and file
bug reports. ;-)

One is that some of the tracers trace multiple events e.g. the
sched_switch tracer uses the context_switch and wakeup events, and if
you don't set filters on all of the traced events, the unfiltered output
from the events without filters on them can make it look like the
filtering as a whole isn't working properly, when actually it is doing
what it was asked to do - it just wasn't asked to do the right thing.

The other is that for the really high-volume tracers e.g. the function
tracer, the volume of filtered events can be so high that it pushes the
unfiltered events out of the ring buffer before they can be read so e.g.
cat'ing the trace file repeatedly shows either no output, or once in
awhile some output but that isn't there the next time you read the
trace, which isn't what you normally expect when reading the trace file.
If you read from the trace_pipe file though, you can catch them before
they disappear.

Changes from v1:

As suggested by Frederic Weisbecker:

- get rid of externs in functions
- added unlikely() to filter_check_discard()

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-14 00:00:50 +02:00
Zhaolei
a3d03ecaf9 tracing: Fix power tracer header
Before patch:
  # tracer: power
  #
  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
  #              | |       |          |         |
  [  676.875865889] CSTATE: Going to C1 on cpu 0 for 0.005911463
  [  676.882938805] CSTATE: Going to C1 on cpu 0 for 0.104796532
  ...

After patch:
  # tracer: power
  #
  #   TIMESTAMP      STATE  EVENT
  #       |            |      |
  [  676.875865889] CSTATE: Going to C1 on cpu 0 for 0.005911463
  [  676.882938805] CSTATE: Going to C1 on cpu 0 for 0.104796532
  ...

v2: Use seq_puts instead of seq_printf

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <49E2E889.5000903@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-13 23:39:57 +02:00
Rafael J. Wysocki
c751085943 PM/Hibernate: Wait for SCSI devices scan to complete during resume
There is a race between resume from hibernation and the asynchronous
scanning of SCSI devices and to prevent it from happening we need to
call scsi_complete_async_scans() during resume from hibernation.

In addition, if the resume from hibernation is userland-driven, it's
better to wait for all device probes in the kernel to complete before
attempting to open the resume device.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 11:37:07 -07:00
Linus Torvalds
8255309b88 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing/filters: return proper error code when writing filter file
  tracing/filters: allow user input integer to be oct or hex
  tracing/filters: fix NULL pointer dereference
  tracing/filters: NIL-terminate user input filter
  ftrace: Output REC->var instead of __entry->var for trace format
  Make __stringify support variable argument macros too
  tracing: fix document references
  tracing: fix splice return too large
  tracing: update file->f_pos when splice(2) it
  tracing: allocate page when needed
  tracing: disable seeking for trace_pipe_raw
2009-04-13 11:31:28 -07:00
Linus Torvalds
bf20753c0c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: continue lock debugging despite some taints
  lockdep: warn about lockdep disabling after kernel taint
2009-04-13 11:30:26 -07:00
Linus Torvalds
d811f236d9 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  percpu: unbreak alpha percpu
  mutex: have non-spinning mutexes on s390 by default
2009-04-13 08:22:43 -07:00
Frederic Weisbecker
574bbe7820 lockdep: continue lock debugging despite some taints
Impact: broaden lockdep checks

Lockdep is disabled after any kernel taints. This might be convenient
to ignore bad locking issues which sources come from outside the kernel
tree. Nevertheless, it might be a frustrating experience for the
staging developers or those who experience a warning but are focused
on another things that require lockdep.

The v2 of this patch simply don't disable anymore lockdep in case
of TAINT_CRAP and TAINT_WARN events.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: LTP <ltp-list@lists.sourceforge.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg KH <gregkh@suse.de>
LKML-Reference: <1239412638-6739-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 16:10:52 +02:00
Frederic Weisbecker
9eeba6138c lockdep: warn about lockdep disabling after kernel taint
Impact: provide useful missing info for developers

Kernel taint can occur in several situations such as warnings,
load of prorietary or staging modules, bad page, etc...

But when such taint happens, a developer might still be working on
the kernel, expecting that lockdep is still enabled. But a taint
disables lockdep without ever warning about it.
Such a kernel behaviour doesn't really help for kernel development.

This patch adds this missing warning.

Since the taint is done most of the time after the main message that
explain the real source issue, it seems safe to warn about it inside
add_taint() so that it appears at last, without hurting the main
information.

v2: Use a generic helper to disable lockdep instead of an
    open coded xchg().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1239412638-6739-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 16:10:51 +02:00
Li Zefan
66de7792c0 blktrace: fix output of BLK_TC_PC events
BLK_TC_PC events should be treated differently with BLK_TC_FS events.

Before this patch:

 # echo 1 > /sys/block/sda/sda1/trace/enable
 # echo pc > /sys/block/sda/sda1/trace/act_mask
 # echo blk > /debugfs/tracing/current_tracer
 # (generate some BLK_TC_PC events)
 # cat trace
        bash-2184  [000]  1774.275413:   8,7    I   N [bash]
        bash-2184  [000]  1774.275435:   8,7    D   N [bash]
        bash-2184  [000]  1774.275540:   8,7    I   R [bash]
        bash-2184  [000]  1774.275547:   8,7    D   R [bash]
 ksoftirqd/0-4     [000]  1774.275580:   8,7    C   N 0 [0]
        bash-2184  [000]  1774.275648:   8,7    I   R [bash]
        bash-2184  [000]  1774.275653:   8,7    D   R [bash]
 ksoftirqd/0-4     [000]  1774.275682:   8,7    C   N 0 [0]
        bash-2184  [000]  1774.275739:   8,7    I   R [bash]
        bash-2184  [000]  1774.275744:   8,7    D   R [bash]
 ksoftirqd/0-4     [000]  1774.275771:   8,7    C   N 0 [0]
        bash-2184  [000]  1774.275804:   8,7    I   R [bash]
        bash-2184  [000]  1774.275808:   8,7    D   R [bash]
 ksoftirqd/0-4     [000]  1774.275836:   8,7    C   N 0 [0]

After this patch:

 # cat trace
        bash-2263  [000]   366.782149:   8,7    I   N 0 (00 ..) [bash]
        bash-2263  [000]   366.782323:   8,7    D   N 0 (00 ..) [bash]
        bash-2263  [000]   366.782557:   8,7    I   R 8 (25 00 ..) [bash]
        bash-2263  [000]   366.782560:   8,7    D   R 8 (25 00 ..) [bash]
 ksoftirqd/0-4     [000]   366.782582:   8,7    C   N (25 00 ..) [0]
        bash-2263  [000]   366.782648:   8,7    I   R 8 (5a 00 3f 00) [bash]
        bash-2263  [000]   366.782650:   8,7    D   R 8 (5a 00 3f 00) [bash]
 ksoftirqd/0-4     [000]   366.782669:   8,7    C   N (5a 00 3f 00) [0]
        bash-2263  [000]   366.782710:   8,7    I   R 8 (5a 00 08 00) [bash]
        bash-2263  [000]   366.782713:   8,7    D   R 8 (5a 00 08 00) [bash]
 ksoftirqd/0-4     [000]   366.782730:   8,7    C   N (5a 00 08 00) [0]
        bash-2263  [000]   366.783375:   8,7    I   R 36 (5a 00 08 00) [bash]
        bash-2263  [000]   366.783379:   8,7    D   R 36 (5a 00 08 00) [bash]
 ksoftirqd/0-4     [000]   366.783404:   8,7    C   N (5a 00 08 00) [0]

This is what we do with PC events in user-space blktrace.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49D32387.9040106@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 15:32:46 +02:00
Li Zefan
b78825d608 blktrace: fix output of unknown events
Not all events are pc (packet command) events. An event is a pc
event only if it has BLK_TC_PC bit set.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49D3236D.3090705@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 15:32:45 +02:00
Zhaolei
02af61bb50 tracing, kmemtrace: Separate include/trace/kmemtrace.h to kmemtrace part and tracepoint part
Impact: refactor code for future changes

Current kmemtrace.h is used both as header file of kmemtrace and kmem's
tracepoints definition.

Tracepoints' definition file may be used by other code, and should only have
definition of tracepoint.

We can separate include/trace/kmemtrace.h into 2 files:

  include/linux/kmemtrace.h: header file for kmemtrace
  include/trace/kmem.h:      definition of kmem tracepoints

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <49DEE68A.5040902@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 15:22:55 +02:00
Li Zefan
44e9c8b7ad tracing/filters: return proper error code when writing filter file
- propagate return value of filter_add_pred() to the user

- return -ENOSPC but not -ENOMEM or -EINVAL when the filter array
  is full

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49E04CF0.3010105@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 11:59:29 +02:00
Li Zefan
a3e0ab0507 tracing/filters: allow user input integer to be oct or hex
Before patch:

 # echo 'parent_pid == 0x10' > events/sched/sched_process_fork/filter
 # cat sched/sched_process_fork/filter
 parent_pid == 0

After patch:

 # cat sched/sched_process_fork/filter
 parent_pid == 16

Also check the input more strictly.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49E04C53.4010600@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 11:59:28 +02:00
Li Zefan
bcabd91c27 tracing/filters: fix NULL pointer dereference
Try this, and you'll see NULL pointer dereference bug:

  # echo -n 'parent_comm ==' > sched/sched_process_fork/filter

Because we passed NULL ptr to simple_strtoull().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49E04C43.1050504@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 11:59:28 +02:00
Li Zefan
8433a40eb7 tracing/filters: NIL-terminate user input filter
Make sure messages from user space are NIL-terminated strings,
otherwise we could dump random memory while reading filter file.

Try this:
 # echo 'parent_comm ==' > events/sched/sched_process_fork/filter
 # cat events/sched/sched_process_fork/filter
 parent_comm == �

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49E04C32.6060508@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-12 11:59:27 +02:00
Linus Torvalds
d6de2c80e9 async: Fix module loading async-work regression
Several drivers use asynchronous work to do device discovery, and we
synchronize with them in the compiled-in case before we actually try to
mount root filesystems etc.

However, when compiled as modules, that synchronization is missing - the
module loading completes, but the driver hasn't actually finished
probing for devices, and that means that any user mode that expects to
use the devices after the 'insmod' is now potentially broken.

We already saw one case of a similar issue in the ACPI battery code,
where the kernel itself expected the module to be all done, and unmapped
the init memory - but the async device discovery was still running.
That got hacked around by just removing the "__init" (see commit
5d38258ec0 "ACPI battery: fix async boot
oops"), but the real fix is to just make the module loading wait for all
async work to be completed.

It will slow down module loading, but since common devices should be
built in anyway, and since the bug is really annoying and hard to handle
from user space (and caused several S3 resume regressions), the simple
fix to wait is the right one.

This fixes at least

	http://bugzilla.kernel.org/show_bug.cgi?id=13063

but probably a few other bugzilla entries too (12936, for example), and
is confirmed to fix Rafael's storage driver breakage after resume bug
report (no bugzilla entry).

We should also be able to now revert that ACPI battery fix.

Reported-and-tested-by: Rafael J. Wysocki <rjw@suse.com>
Tested-by: Heinz Diehl <htd@fancy-poultry.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-11 12:44:49 -07:00
Darren Hart
a5a2a0c7fa futex: fix futex_wait_setup key handling
If the get_futex_key() call were to fail, the existing code would
try and put_futex_key() prior to returning.  This patch makes sure
we only put_futex_key() if get_futex_key() succeeded.

Reported-by: Clark Williams <williams@redhat.com>
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <20090410165005.14342.16973.stgit@Aeon>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-10 22:04:24 +02:00
Zhaolei
0462b5664b ftrace: Output REC->var instead of __entry->var for trace format
print fmt: "irq=%d return=%s", __entry->irq, __entry->ret ? \"handled\" : \"unhandled\"

"__entry" should be convert to "REC" by __stringify() macro.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49DC679D.2090901@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 15:48:53 +02:00
Li Zefan
4d1f4372db tracing: fix document references
When moving documents to Documentation/trace/, I forgot to
grep Kconfig to find out those references.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Pekka Paalanen <pq@iki.fi>
Cc: eduard.munteanu@linux360.ro
LKML-Reference: <49DE97EF.7080208@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 13:08:50 +02:00
Frederic Weisbecker
2062501ae6 tracing/lockdep: report the time waited for a lock
While trying to optimize the new lock on reiserfs to replace
the bkl, I find the lock tracing very useful though it lacks
something important for performance (and latency) instrumentation:
the time a task waits for a lock.

That's what this patch implements:

  bash-4816  [000]   202.652815: lock_contended: lock_contended: &sb->s_type->i_mutex_key
  bash-4816  [000]   202.652819: lock_acquired: &rq->lock (0.000 us)
 <...>-4787  [000]   202.652825: lock_acquired: &rq->lock (0.000 us)
 <...>-4787  [000]   202.652829: lock_acquired: &rq->lock (0.000 us)
  bash-4816  [000]   202.652833: lock_acquired: &sb->s_type->i_mutex_key (16.005 us)

As shown above, the "lock acquired" field is followed by the time
it has been waiting for the lock. Usually, a lock contended entry
is followed by a near lock_acquired entry with a non-zero time waited.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1238975373-15739-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 12:50:56 +02:00
Ingo Molnar
1cad1252ed Merge branch 'tracing/urgent' into tracing/core
Merge reason: pick up both v2.6.30-rc1 [which includes tracing/urgent fixes]
              and pick up the current lineup of tracing/urgent fixes as well

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 12:46:51 +02:00
Lai Jiangshan
93cfb3c9fd tracing: fix splice return too large
I got these from strace:

 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288
 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288
 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 12288
 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 16384
 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192
 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192
 splice(0x3, 0, 0x5, 0, 0x1000, 0x1) = 8192

I wanted to splice_read 4096 bytes, but it returns 8192 or larger.

It is because the return value of tracing_buffers_splice_read()
does not include "zero out any left over data" bytes.

But tracing_buffers_read() includes these bytes, we make them
consistent.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49D46674.9030804@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 12:44:46 +02:00
Lai Jiangshan
c7625a555f tracing: update file->f_pos when splice(2) it
Impact: Cleanup

These two lines:

	if (unlikely(*ppos))
		return -ESPIPE;

in tracing_buffers_splice_read() are not needed, VFS layer
has disabled seek(2).

We remove these two lines, and then we can update file->f_pos.

And tracing_buffers_read() updates file->f_pos, this fix
make tracing_buffers_splice_read() updates file->f_pos too.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49D46670.4010503@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 12:44:44 +02:00
Lai Jiangshan
ddd538f3e6 tracing: allocate page when needed
Impact: Cleanup

Sometimes, we open trace_pipe_raw, but we don't read(2) it,
we just splice(2) it, thus, the page is not used.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49D4666B.4010608@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 12:44:43 +02:00
Lai Jiangshan
d1e7e02f30 tracing: disable seeking for trace_pipe_raw
Impact: disable pread()

We set tracing_buffers_fops.llseek to no_llseek,
but we can still perform pread() to read this file.

That is not expected.

This fix uses nonseekable_open() to disable it.

tracing_buffers_fops.llseek is still set to no_llseek,
it mark this file is a "non-seekable device" and is used by
sys_splice(). See also do_splice() or manual of splice(2):

ERRORS
       EINVAL Target file system doesn't support  splicing;
              neither  of the descriptors refers to a pipe;
              or offset given for non-seekable device.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49D46668.8030806@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-10 12:44:42 +02:00
Linus Torvalds
c2ea122cd7 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: consolidate documents
  blktrace: pass the right pointer to kfree()
  tracing/syscalls: use a dedicated file header
  tracing: append a comma to INIT_FTRACE_GRAPH
2009-04-09 10:37:46 -07:00
Linus Torvalds
17b2e9bf27 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: do not count frozen tasks toward load
  sched: refresh MAINTAINERS entry
  sched: Print sched_group::__cpu_power in sched_domain_debug
  cpuacct: add per-cgroup utime/stime statistics
  posixtimers, sched: Fix posix clock monotonicity
  sched_rt: don't allocate cpumask in fastpath
  cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2
2009-04-09 10:37:28 -07:00
Linus Torvalds
422a253483 Merge branches 'core-fixes-for-linus', 'irq-fixes-for-linus' and 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  printk: fix wrong format string iter for printk
  futex: comment requeue key reference semantics

* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  irq: fix cpumask memory leak on offstack cpumask kernels

* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  posix-timers: fix RLIMIT_CPU && setitimer(CPUCLOCK_PROF)
  posix-timers: fix RLIMIT_CPU && fork()
  timers: add missing kernel-doc
2009-04-09 10:35:30 -07:00
Heiko Carstens
36cd3c9f92 mutex: have non-spinning mutexes on s390 by default
Impact: performance regression fix for s390

The adaptive spinning mutexes will not always do what one would expect on
virtualized architectures like s390. Especially the cpu_relax() loop in
mutex_spin_on_owner might hurt if the mutex holding cpu has been scheduled
away by the hypervisor.

We would end up in a cpu_relax() loop when there is no chance that the
state of the mutex changes until the target cpu has been scheduled again by
the hypervisor.

For that reason we should change the default behaviour to no-spin on s390.

We do have an instruction which allows to yield the current cpu in favour of
a different target cpu. Also we have an instruction which allows us to figure
out if the target cpu is physically backed.

However we need to do some performance tests until we can come up with
a solution that will do the right thing on s390.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
LKML-Reference: <20090409184834.7a0df7b2@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 19:28:24 +02:00
Peter Zijlstra
d3d21c412d perf_counter: log full path names
Impact: fix perf-report output for /home mounted binaries, etc.

dentry_path() only provide path-names up to the mount root, which is
unsuited for out purpose, use d_path() instead.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090409085524.601794134@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 11:50:54 +02:00
Peter Zijlstra
1ccd154978 perf_counter: sysctl for system wide perf counters
Impact: add sysctl for paranoid/relaxed perfcounters policy

Allow the use of system wide perf counters to everybody, but provide
a sysctl to disable it for the paranoid security minded.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090409085524.514046352@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 11:50:52 +02:00
Peter Zijlstra
9ee318a782 perf_counter: optimize mmap/comm tracking
Impact: performance optimization

The mmap/comm tracking code does quite a lot of work before it discovers
there's no interest in it, avoid that by keeping a counter.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090409085524.427173196@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 11:50:43 +02:00
Ingo Molnar
888fcee066 perf_counter: fix off task->comm by one
strlen() does not include the \0.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 09:48:22 +02:00
Paul Turner
002f128b47 sched: remove redundant hierarchy walk in check_preempt_wakeup
Impact: micro-optimization

Under group scheduling we traverse up until we are at common siblings
to make the wakeup comparison on.

At this point however, they should have the same parent so continuing
to check up the tree is redundant.

Signed-off-by: Paul Turner <pjt@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <alpine.DEB.1.00.0904081520320.30317@kitami.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 08:19:08 +02:00
Li Zefan
9eb85125ce blktrace: pass the right pointer to kfree()
Impact: fix kfree crash with non-standard act_mask string

If passing a string with leading white spaces to strstrip(),
the returned ptr != the original ptr.

This bug was introduced by me.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49DD694C.8020902@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 05:52:40 +02:00
Frederic Weisbecker
47788c58e6 tracing/syscalls: use a dedicated file header
Impact: fix build warnings and possibe compat misbehavior on IA64

Building a kernel on ia64 might trigger these ugly build warnings:

CC      arch/ia64/ia32/sys_ia32.o
In file included from arch/ia64/ia32/sys_ia32.c:55:
arch/ia64/ia32/ia32priv.h:290:1: warning: "elf_check_arch" redefined
In file included from include/linux/elf.h:7,
                 from include/linux/module.h:14,
                 from include/linux/ftrace.h:8,
                 from include/linux/syscalls.h:68,
                 from arch/ia64/ia32/sys_ia32.c:18:
arch/ia64/include/asm/elf.h:19:1: warning: this is the location of the previous definition
[...]

sys_ia32.c includes linux/syscalls.h which in turn includes linux/ftrace.h
to import the syscalls tracing prototypes.

But including ftrace.h can pull too much things for a low level file,
especially on ia64 where the ia32 private headers conflict with higher
level headers.

Now we isolate the syscall tracing headers in their own lightweight file.

Reported-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jason Baron <jbaron@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Davidson <md@google.com>
LKML-Reference: <20090408184058.GB6017@nowhere>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-09 05:43:32 +02:00
Andrew Morton
6b44003e5c work_on_cpu(): rewrite it to create a kernel thread on demand
Impact: circular locking bugfix

The various implemetnations and proposed implemetnations of work_on_cpu()
are vulnerable to various deadlocks because they all used queues of some
form.

Unrelated pieces of kernel code thus gained dependencies wherein if one
work_on_cpu() caller holds a lock which some other work_on_cpu() callback
also takes, the kernel could rarely deadlock.

Fix this by creating a short-lived kernel thread for each work_on_cpu()
invokation.

This is not terribly fast, but the only current caller of work_on_cpu() is
pci_call_probe().

It would be nice to find some other way of doing the node-local
allocations in the PCI probe code so that we can zap work_on_cpu()
altogether.  The code there is rather nasty.  I can't think of anything
simple at this time...

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-04-09 09:50:37 +09:30
Oleg Nesterov
1c99315bb3 kthread: move sched-realeted initialization from kthreadd context
kthreadd is the single thread which implements ths "create" request, move
sched_setscheduler/etc from create_kthread() to kthread_create() to
improve the scalability.

We should be careful with sched_setscheduler(), use _nochek helper.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Vitaliy Gusev <vgusev@openvz.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-04-09 09:50:37 +09:30
Vitaliy Gusev
3217ab97f1 kthread: Don't looking for a task in create_kthread() #2
Remove the unnecessary find_task_by_pid_ns(). kthread() can just
use "current" to get the same result.

Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-04-09 09:50:36 +09:30
Roland McGrath
3a70970353 ptrace: some checkpatch fixes
This fixes all the checkpatch --file complaints about kernel/ptrace.c
and also removes an unused #include.  I've verified that there are no
changes to the compiled code on x86_64.

Signed-off-by: Roland McGrath <roland@redhat.com>
[ Removed the parts that just split a line  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-08 10:21:44 -07:00
Peter Zijlstra
78f13e9525 perf_counter: allow for data addresses to be recorded
Paul suggested we allow for data addresses to be recorded along with
the traditional IPs as power can provide these.

For now, only the software pagefault events provide data addresses,
but in the future power might as well for some events.

x86 doesn't seem capable of providing this atm.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130409.394816925@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 19:05:56 +02:00
Peter Zijlstra
4d855457d8 perf_counter: move PERF_RECORD_TIME
Move PERF_RECORD_TIME so that all the fixed length items come before
the variable length ones.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130409.307926436@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 19:05:55 +02:00
Peter Zijlstra
8d1b2d9361 perf_counter: track task-comm data
Similar to the mmap data stream, add one that tracks the task COMM field,
so that the userspace reporting knows what to call a task.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130409.127422406@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 19:05:47 +02:00
Peter Zijlstra
6b6e5486b3 perf_counter: use misc field to widen type
Push the PERF_EVENT_COUNTER_OVERFLOW bit into the misc field so that
we can have the full 32bit for PERF_RECORD_ bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130408.891867663@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 18:53:28 +02:00
Peter Zijlstra
6fab01927e perf_counter: provide misc bits in the event header
Limit the size of each record to 64k (or should we count in multiples
of u64 and have a 512K limit?), this gives 16 bits or spare room in the
header, which we can use for misc bits, so as to not have to grow the
record with u64 every time we have a few bits to report.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130408.769271806@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 18:53:27 +02:00
Peter Zijlstra
e30e08f65c perf_counter: fix NMI race in task clock
We should not be updating ctx->time from NMI context, work around that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130408.681326666@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 18:53:27 +02:00
Oleg Nesterov
8f2e586567 posix-timers: fix RLIMIT_CPU && setitimer(CPUCLOCK_PROF)
update_rlimit_cpu() tries to optimize out set_process_cpu_timer() in case
when we already have CPUCLOCK_PROF timer which should expire first. But it
uses cputime_lt() instead of cputime_gt().

Test case:

	int main(void)
	{
		struct itimerval it = {
			.it_value = { .tv_sec = 1000 },
		};

		assert(!setitimer(ITIMER_PROF, &it, NULL));

		struct rlimit rl = {
			.rlim_cur = 1,
			.rlim_max = 1,
		};

		assert(!setrlimit(RLIMIT_CPU, &rl));

		for (;;)
			;

		return 0;
	}

Without this patch, the task is not killed as RLIMIT_CPU demands.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Peter Lojkin <ia6432@inbox.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: stable@kernel.org
LKML-Reference: <20090327000610.GA10108@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 17:51:39 +02:00
Oleg Nesterov
6279a751fe posix-timers: fix RLIMIT_CPU && fork()
See http://bugzilla.kernel.org/show_bug.cgi?id=12911

copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because
posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus
fastpath_timer_check() returns false unless we have other expired cpu timers.

Change copy_signal() to set cputime_expires.prof_exp if we have RLIMIT_CPU.
Also, set cputimer.running = 1 in that case. This is not strictly necessary,
but imho makes sense.

Reported-by: Peter Lojkin <ia6432@inbox.ru>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Peter Lojkin <ia6432@inbox.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: stable@kernel.org
LKML-Reference: <20090327000607.GA10104@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 17:51:38 +02:00
Ingo Molnar
5af8c4e0fa Merge commit 'v2.6.30-rc1' into sched/urgent
Merge reason: update to latest upstream to queue up fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 17:26:00 +02:00
Ingo Molnar
ff96e612cb Merge commit 'v2.6.30-rc1' into core/urgent
Merge reason: need latest upstream to queue up dependent fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 17:02:57 +02:00
Darren Hart
bab5bc9e85 futex: fixup unlocked requeue pi case
Thomas's testing caught a problem when the requeue target futex is
unowned and multiple tasks are requeued to it. This patch ensures
the FUTEX_WAITERS bit gets set if futex_requeue() will requeue one
or more tasks in addition to the one acquiring the lock.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-08 12:58:20 +02:00
Ingo Molnar
5ea472a77f Merge commit 'v2.6.30-rc1' into perfcounters/core
Conflicts:
	arch/powerpc/include/asm/systbl.h
	arch/powerpc/include/asm/unistd.h
	include/linux/init_task.h

Merge reason: the conflicts are non-trivial: PowerPC placement
              of sys_perf_counter_open has to be mixed with the
	      new preadv/pwrite syscalls.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-08 10:35:30 +02:00
Linus Torvalds
1551260d1f Merge branch 'core/softlockup' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core/softlockup' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softlockup: make DETECT_HUNG_TASK default depend on DETECT_SOFTLOCKUP
  softlockup: move 'one' to the softlockup section in sysctl.c
  softlockup: ensure the task has been switched out once
  softlockup: remove timestamp checking from hung_task
  softlockup: convert read_lock in hung_task to rcu_read_lock
  softlockup: check all tasks in hung_task
  softlockup: remove unused definition for spawn_softlockup_task
  softlockup: fix potential race in hung_task when resetting timeout
  softlockup: fix to allow compiling with !DETECT_HUNG_TASK
  softlockup: decouple hung tasks check from softlockup detection
2009-04-07 14:11:07 -07:00
Linus Torvalds
c93f216b5b Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y
  branch tracer: Fix for enabling branch profiling makes sparse unusable
  ftrace: Correct a text align for event format output
  Update /debug/tracing/README
  tracing/ftrace: alloc the started cpumask for the trace file
  tracing, x86: remove duplicated #include
  ftrace: Add check of sched_stopped for probe_sched_wakeup
  function-graph: add proper initialization for init task
  tracing/ftrace: fix missing include string.h
  tracing: fix incorrect return type of ns2usecs()
  tracing: remove CALLER_ADDR2 from wakeup tracer
  blktrace: fix pdu_len when tracing packet command requests
  blktrace: small cleanup in blk_msg_write()
  blktrace: NUL-terminate user space messages
  tracing: move scripts/trace/power.pl to scripts/tracing/power.pl
2009-04-07 14:10:10 -07:00
Linus Torvalds
c61b79b6ef Merge branch 'irq/threaded' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq/threaded' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: fix devres.o build for GENERIC_HARDIRQS=n
  genirq: provide old request_irq() for CONFIG_GENERIC_HARDIRQ=n
  genirq: threaded irq handlers review fixups
  genirq: add support for threaded interrupts to devres
  genirq: add threaded interrupt handler support
2009-04-07 14:07:52 -07:00
Jeremy Fitzhardinge
38f4b8c0da Merge commit 'origin/master' into for-linus/xen/master
* commit 'origin/master': (4825 commits)
  Fix build errors due to CONFIG_BRANCH_TRACER=y
  parport: Use the PCI IRQ if offered
  tty: jsm cleanups
  Adjust path to gpio headers
  KGDB_SERIAL_CONSOLE check for module
  Change KCONFIG name
  tty: Blackin CTS/RTS
  Change hardware flow control from poll to interrupt driven
  Add support for the MAX3100 SPI UART.
  lanana: assign a device name and numbering for MAX3100
  serqt: initial clean up pass for tty side
  tty: Use the generic RS485 ioctl on CRIS
  tty: Correct inline types for tty_driver_kref_get()
  splice: fix deadlock in splicing to file
  nilfs2: support nanosecond timestamp
  nilfs2: introduce secondary super block
  nilfs2: simplify handling of active state of segments
  nilfs2: mark minor flag for checkpoint created by internal operation
  nilfs2: clean up sketch file
  nilfs2: super block operations fix endian bug
  ...

Conflicts:
	arch/x86/include/asm/thread_info.h
	arch/x86/lguest/boot.c
	drivers/xen/manage.c
2009-04-07 13:34:16 -07:00
Masami Hiramatsu
de5bd88d5a kprobes: support per-kprobe disabling
Add disable_kprobe() and enable_kprobe() to disable/enable kprobes
temporarily.

disable_kprobe() asynchronously disables probe handlers of specified
kprobe.  So, after calling it, some handlers can be called at a while.
enable_kprobe() enables specified kprobe.

aggr_pre_handler and aggr_post_handler check disabled probes.  On the
other hand aggr_break_handler and aggr_fault_handler don't check it
because these handlers will be called while executing pre or post handlers
and usually those help error handling.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:08 -07:00
Masami Hiramatsu
e579abeb58 kprobes: rename kprobe_enabled to kprobes_all_disarmed
Rename kprobe_enabled to kprobes_all_disarmed and invert logic due to
avoiding naming confusion from per-probe disabling.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:08 -07:00
Masami Hiramatsu
99081ab553 kprobes: move EXPORT_SYMBOL_GPL just after function definitions
Clean up positions of EXPORT_SYMBOL_GPL in kernel/kprobes.c according to
checkpatch.pl.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:08 -07:00
Masami Hiramatsu
b918e5e60d kprobes: cleanup aggr_kprobe related code
Currently, kprobes can disable all probes at once, but can't disable it
individually (not unregister, just disable an kprobe, because
unregistering needs to wait for scheduler synchronization).  These patches
introduce APIs for on-the-fly per-probe disabling and re-enabling by
dis-arming/re-arming its breakpoint instruction.

This patch:

Change old_p to ap in add_new_kprobe() for readability, copy flags member
in add_aggr_kprobe(), and simplify the code flow of
register_aggr_kprobe().

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:07 -07:00
Peter W Morreale
fafd688e4c mm: add /proc controls for pdflush threads
Add /proc entries to give the admin the ability to control the minimum and
maximum number of pdflush threads.  This allows finer control of pdflush
on both large and small machines.

The rationale is simply one size does not fit all.  Admins on large and/or
small systems may want to tune the min/max pdflush thread count to best
suit their needs.  Right now the min/max is hardcoded to 2/8.  While
probably a fair estimate for smaller machines, large machines with large
numbers of CPUs and large numbers of filesystems/block devices may benefit
from larger numbers of threads working on different block devices.

Even if the background flushing algorithm is radically changed, it is
still likely that multiple threads will be involved and admins would still
desire finer control on the min/max other than to have to recompile the
kernel.

The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
'/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.

The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
the current value of nr_pdflush_threads_max.  This minimum is required
since additional thread creation is performed in a pdflush thread itself.

The minimum value for nr_pdflush_threads_max is the current value of
nr_pdflush_threads_min and the maximum value can be 1000.

Documentation/sysctl/vm.txt is also updated.

[akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
Signed-off-by: Peter W Morreale <pmorreale@novell.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:03 -07:00
Zhaolei
dcef788eb9 ftrace: clean up enable logic for sched_switch
Unify sched_switch and sched_wakeup's action to following logic:
Do record_cmdline when start_cmdline_record() is called.
Start tracing events when the tracer is started.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
LKML-Reference: <49D1C596.5050203@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-07 14:43:09 +02:00
Steven Rostedt
597af81537 function-graph: use int instead of atomic for ftrace_graph_active
Impact: cleanup

The variable ftrace_graph_active is only modified under the
ftrace_lock mutex, thus an atomic is not necessary for modification.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-07 14:43:08 +02:00
Frederic Weisbecker
5452af664f tracing/ftrace: factorize the tracing files creation
Impact: cleanup

Most of the tracing files creation follow the same pattern:

ret = debugfs_create_file(...)
if (!ret)
	pr_warning("Couldn't create ... entry\n")

Unify it!

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1238109938-11840-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-07 14:43:07 +02:00
Li Zefan
a5dec5573f tracing: use macros to denote usec and nsec per second
Impact: cleanup

Use USEC_PER_SEC and NSEC_PER_SEC instead of 1000000 and 1000000000.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <49CC7870.9000309@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-07 14:43:06 +02:00
Ingo Molnar
86665c75da Merge branch 'tracing/urgent' into tracing/ftrace 2009-04-07 14:41:17 +02:00
Zhaolei
1bbe2a83ab ftrace: Correct a text align for event format output
If we cat debugfs/tracing/events/ftrace/bprint/format, we'll see:
name: bprint
ID: 6
format:
	field:unsigned char common_type;	offset:0;	size:1;
	field:unsigned char common_flags;	offset:1;	size:1;
	field:unsigned char common_preempt_count;	offset:2;	size:1;
	field:int common_pid;	offset:4;	size:4;
	field:int common_tgid;	offset:8;	size:4;

	field:unsigned long ip;	offset:12;	size:4;
	field:char * fmt;	offset:16;	size:4;
	field: char buf;	offset:20;	size:0;

print fmt: "%08lx (%d) fmt:%p %s"

There is an inconsistent blank before char buf.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
LKML-Reference: <49D5E3EE.70201@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 14:02:42 +02:00
Nikanth Karthikesan
bc2b6871c1 Update /debug/tracing/README
Some of the tracers have been renamed, which was not updated in the in-kernel
run-time README file. Update it.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
LKML-Reference: <200903231158.32151.knikanth@suse.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 14:02:36 +02:00
Frederic Weisbecker
b0dfa978c7 tracing/ftrace: alloc the started cpumask for the trace file
Impact: fix a crash while cat trace file

Currently we are using a cpumask to remind each cpu where a
trace occured. It lets us notice the user that a cpu just had
its first trace.

But on latest -tip we have the following crash once we cat the trace
file:

IP: [<c0270c4a>] print_trace_fmt+0x45/0xe7
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/class/net/eth0/carrier
Pid: 3897, comm: cat Not tainted (2.6.29-tip-02825-g0f22972-dirty #81)
EIP: 0060:[<c0270c4a>] EFLAGS: 00010297 CPU: 0
EIP is at print_trace_fmt+0x45/0xe7
EAX: 00000000 EBX: 00000000 ECX: c12d9e98 EDX: ccdb7010
ESI: d31f4000 EDI: 00322401 EBP: d31f3f10 ESP: d31f3efc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process cat (pid: 3897, ti=d31f2000 task=d3b3cf20 task.ti=d31f2000)
Stack:
d31f4080 ccdb7010 d31f4000 d691fe70 ccdb7010 d31f3f24 c0270e5c d31f4000
d691fe70 d31f4000 d31f3f34 c02718e8 c12d9e98 d691fe70 d31f3f70 c02bfc33
00001000 09130000 d3b46e00 d691fe98 00000000 00000079 00000001 00000000
Call Trace:
[<c0270e5c>] ? print_trace_line+0x170/0x17c
[<c02718e8>] ? s_show+0xa7/0xbd
[<c02bfc33>] ? seq_read+0x24a/0x327
[<c02bf9e9>] ? seq_read+0x0/0x327
[<c02ab18b>] ? vfs_read+0x86/0xe1
[<c02ab289>] ? sys_read+0x40/0x65
[<c0202d8f>] ? sysenter_do_call+0x12/0x3c
Code: 00 00 00 89 45 ec f7 c7 00 20 00 00 89 55 f0 74 4e f6 86 98 10 00 00 02 74 45 8b 86 8c 10 00 00 8b 9e a8 10 00 00 e8 52 f3 ff ff <0f> a3 03 19 c0 85 c0 75 2b 8b 86 8c 10 00 00 8b 9e a8 10 00 00
EIP: [<c0270c4a>] print_trace_fmt+0x45/0xe7 SS:ESP 0068:d31f3efc
CR2: 0000000000000000
---[ end trace aa9cf38e5ebed9dd ]---

This is because we alloc the iter->started cpumask on tracing_pipe_open but
not on tracing_open.

It hadn't been noticed until now because we need to have ring buffer overruns
to activate the starting of cpu buffer detection.

Also, we need a check to not print the messagge for the first trace on the file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1238619188-6109-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 14:02:03 +02:00
Zhaolei
8bcae09b93 ftrace: Add check of sched_stopped for probe_sched_wakeup
The wakeup tracing in sched_switch does not stop when a user
disables tracing. This is because the probe_sched_wakeup() is missing
the check to prevent the wakeup from being traced.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
LKML-Reference: <49D1C543.3010307@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 14:01:11 +02:00
Frederic Weisbecker
5f0c6c03c5 tracing/ftrace: fix missing include string.h
Building a kernel with tracing can raise the following warning on
tip/master:

kernel/trace/trace.c:1249: error: implicit declaration of function 'vbin_printf'

We are missing an include to string.h

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1238160130-7437-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 14:00:18 +02:00
Lai Jiangshan
cf8e347465 tracing: fix incorrect return type of ns2usecs()
Impact: fix time output bug in 32bits system

ns2usecs() returns 'long', it's incorrect.

(In i386)
...
          <idle>-0     [000]   521.442100: _spin_lock <-tick_do_update_jiffies64
          <idle>-0     [000]   521.442101: do_timer <-tick_do_update_jiffies64
          <idle>-0     [000]   521.442102: update_wall_time <-do_timer
          <idle>-0     [000]   521.442102: update_xtime_cache <-update_wall_time
....
(It always print the time less than 2200 seconds besides ...)
Because 'long' is 32bits in i386. ( (1<<31) useconds is about 2200 seconds)

...
          <idle>-0     [001] 4154502640.134759: rcu_bh_qsctr_inc <-__do_softirq
          <idle>-0     [001] 4154502640.134760: _local_bh_enable <-__do_softirq
          <idle>-0     [001] 4154502640.134761: idle_cpu <-irq_exit
...
(very large value)
Because 'long' is a signed type and it is 32bits in i386.

Changes in v2:
return 'unsigned long long' instead of 'cycle_t'

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <49D05D10.4030009@cn.fujitsu.com>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:59:23 +02:00
Steven Rostedt
301fd748e2 tracing: remove CALLER_ADDR2 from wakeup tracer
Maneesh Soni was getting a crash when running the wakeup tracer.
We debugged it down to the recording of the function with the
CALLER_ADDR2 macro.  This is used to get the location of the caller
to schedule.

But the problem comes when schedule is called by assmebly. In the case
that Maneesh had, retint_careful would call schedule. But retint_careful
does not set up a proper frame pointer. CALLER_ADDR2 is defined as
__builtin_return_address(2). This produces the following assembly in
the wakeup tracer code.

   mov    0x0(%rbp),%rcx  <--- get the frame pointer of the caller
   mov    %r14d,%r8d
   mov    0xf2de8e(%rip),%rdi

   mov    0x8(%rcx),%rsi  <-- this is __builtin_return_address(1)
   mov    0x28(%rdi,%rax,8),%rbx

   mov    (%rcx),%rax  <-- get the frame pointer of the caller's caller
   mov    %r12,%rcx
   mov    0x8(%rax),%rdx <-- this is __builtin_return_address(2)

At the reading of 0x8(%rax) Maneesh's machine would take a fault.
The reason is that retint_careful did not set up the return address
and the content of %rax here was zero.

To verify this, I sent Maneesh a patch to create a frame pointer
in retint_careful. He ran the test again but this time he would take
the same type of fault from sysret_careful. The retint_careful was no
longer an issue, but there are other callers that still have issues.

Instead of adding frame pointers for all callers to schedule (in possibly
all archs), it is much safer to simply not use CALLER_ADDR2. This
loses out on knowing what called schedule, but the function tracer
will help there if needed.

Reported-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:58:54 +02:00
Ingo Molnar
93776a8ec7 Merge branch 'linus' into tracing/core
Merge reason: update to upstream tracing facilities

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:47:45 +02:00
Ingo Molnar
a053958f07 Merge branch 'tracing/blktrace-fixes' into tracing/urgent
Merge reason: this used to be a tracing/blktrace-v2 devel topic still
              cooking during the merge window - has propagated to fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:40:55 +02:00
Markus Metzger
0f4814065f x86, ptrace: add bts context unconditionally
Add the ptrace bts context field to task_struct unconditionally.

Initialize the field directly in copy_process().
Remove all the unneeded functionality used to initialize that field.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Cc: roland@redhat.com
Cc: eranian@googlemail.com
Cc: oleg@redhat.com
Cc: juan.villacis@intel.com
Cc: ak@linux.jf.intel.com
LKML-Reference: <20090403144603.292754000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:36:31 +02:00
Markus Metzger
4d657e51df x86, hw-branch-tracer: allocate selftest iterator on heap
Allocate the trace_iterator for the hw-branch-tracer selftest on the heap.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Cc: roland@redhat.com
Cc: eranian@googlemail.com
Cc: oleg@redhat.com
Cc: juan.villacis@intel.com
Cc: ak@linux.jf.intel.com
LKML-Reference: <20090403144556.578777000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:36:21 +02:00
Markus Metzger
de79f54f53 x86, bts, hw-branch-tracer: add _noirq variants to the debug store interface
The hw-branch-tracer uses debug store functions from an on_each_cpu()
context, which is simply wrong since the functions may sleep.

Add _noirq variants for most functions, which  may be called with
interrupts disabled.

Separate per-cpu and per-task tracing and allow per-cpu tracing to be
controlled from any cpu.

Make the hw-branch-tracer use the new debug store interface, synchronize
with hotplug cpu event using get/put_online_cpus(), and remove the
unnecessary spinlock.

Make the ptrace bts and the ds selftest code use the new interface.

Defer the ds selftest.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Cc: roland@redhat.com
Cc: eranian@googlemail.com
Cc: oleg@redhat.com
Cc: juan.villacis@intel.com
Cc: ak@linux.jf.intel.com
LKML-Reference: <20090403144555.658136000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:36:20 +02:00
Markus Metzger
a26b89f05d sched, hw-branch-tracer: add wait_task_context_switch() function to sched.h
Add a function to wait until some other task has been
switched out at least once.

This differs from wait_task_inactive() subtly, in that the
latter will wait until the task has left the CPU.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Cc: markus.t.metzger@gmail.com
Cc: roland@redhat.com
Cc: eranian@googlemail.com
Cc: oleg@redhat.com
Cc: juan.villacis@intel.com
Cc: ak@linux.jf.intel.com
LKML-Reference: <20090403144549.794157000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:36:12 +02:00
Ingo Molnar
2e8844e13a Merge branch 'linus' into tracing/hw-branch-tracing
Merge reason: update to latest tracing and ptrace APIs

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 13:34:42 +02:00
Ingo Molnar
6c009ecef8 Merge branch 'linus' into perfcounters/core
Merge reason: need the upstream facility added by:

  7f1e2ca: hrtimer: fix rq->lock inversion (again)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 12:05:25 +02:00
Ingo Molnar
5e34437840 Merge branch 'linus' into core/softlockup
Conflicts:
	kernel/sysctl.c
2009-04-07 11:15:40 +02:00
Peter Zijlstra
bce379bf35 perf_counter: minimize context time updates
Push the update_context_time() calls up the stack so that we get less
invokations and thereby a less noisy output:

before:

 # ./perfstat -e 1:0 -e 1:1 -e 1:1 -e 1:1 -l ls > /dev/null

 Performance counter stats for 'ls':

      10.163691  cpu clock ticks      (msecs)  (scaled from 98.94%)
      10.215360  task clock ticks     (msecs)  (scaled from 98.18%)
      10.185549  task clock ticks     (msecs)  (scaled from 98.53%)
      10.183581  task clock ticks     (msecs)  (scaled from 98.71%)

 Wall-clock time elapsed:    11.912858 msecs

after:

 # ./perfstat -e 1:0 -e 1:1 -e 1:1 -e 1:1 -l ls > /dev/null

 Performance counter stats for 'ls':

       9.316630  cpu clock ticks      (msecs)
       9.280789  task clock ticks     (msecs)
       9.280789  task clock ticks     (msecs)
       9.280789  task clock ticks     (msecs)

 Wall-clock time elapsed:     9.574872 msecs

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.618876874@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:49:01 +02:00
Peter Zijlstra
849691a6cd perf_counter: remove rq->lock usage
Now that all the task runtime clock users are gone, remove the ugly
rq->lock usage from perf counters, which solves the nasty deadlock
seen when a software task clock counter was read from an NMI overflow
context.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.531137582@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:49:01 +02:00
Peter Zijlstra
a39d6f2556 perf_counter: rework the task clock software counter
Rework the task clock software counter to use the context time instead
of the task runtime clock, this removes the last such user.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.445450972@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:49:00 +02:00
Peter Zijlstra
4af4998b8a perf_counter: rework context time
Since perf_counter_context is switched along with tasks, we can
maintain the context time without using the task runtime clock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.353552838@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:49:00 +02:00
Peter Zijlstra
4c9e25428f perf_counter: change event definition
Currently the definition of an event is slightly ambiguous. We have
wakeup events, for poll() and SIGIO, which are either generated
when a record crosses a page boundary (hw_events.wakeup_events == 0),
or every wakeup_events new records.

Now a record can be either a counter overflow record, or a number of
different things, like the mmap PROT_EXEC region notifications.

Then there is the PERF_COUNTER_IOC_REFRESH event limit, which only
considers counter overflows.

This patch changes then wakeup_events and SIGIO notification to only
consider overflow events. Furthermore it changes the SIGIO notification
to report SIGHUP when the event limit is reached and the counter will
be disabled.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.266679874@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:59 +02:00
Peter Zijlstra
79f1464156 perf_counter: counter overflow limit
Provide means to auto-disable the counter after 'n' overflow events.

Create the counter with hw_event.disabled = 1, and then issue an
ioctl(fd, PREF_COUNTER_IOC_REFRESH, n); to set the limit and enable
the counter.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.083139737@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:58 +02:00
Peter Zijlstra
339f7c90b8 perf_counter: PERF_RECORD_TIME
By popular request, provide means to log a timestamp along with the
counter overflow event.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094518.024173282@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:57 +02:00
Peter Zijlstra
ebb3c4c4cb perf_counter: fix the mlock accounting
Reading through the code I saw I forgot the finish the mlock accounting.
Do so now.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094517.899767331@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:57 +02:00
Peter Zijlstra
f6c7d5fe58 perf_counter: theres more to overflow than writing events
Prepare for more generic overflow handling. The new perf_counter_overflow()
method will handle the generic bits of the counter overflow, and can return
a !0 return value, in which case the counter should be (soft) disabled, so
that it won't count until it's properly disabled.

XXX: do powerpc and swcounter

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094517.812109629@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:56 +02:00
Peter Zijlstra
671dec5daf perf_counter: generalize pending infrastructure
Prepare the pending infrastructure to do more than wakeups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094517.634732847@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:55 +02:00
Peter Zijlstra
3c446b3d3b perf_counter: SIGIO support
Provide support for fcntl() I/O availability signals.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094517.579788800@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:55 +02:00
Peter Zijlstra
9c03d88e32 perf_counter: add more context information
Change the callchain context entries to u16, so as to gain some space.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090406094517.457320003@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-07 10:48:54 +02:00
Rusty Russell
2e45e77787 Revert "module: remove the SHF_ALLOC flag on the __versions section."
This reverts commit 9cb610d8e3.

This was an impressively stupid patch.  Firstly, we reset the SHF_ALLOC
flag lower down in the same function, so the patch was useless.  Even
better, find_sec() ignores sections with SHF_ALLOC not set, so
it breaks CONFIG_MODVERSIONS=y with CONFIG_MODULE_FORCE_LOAD=n, which
refuses to load the module since it can't find the __versions section.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-04-07 17:12:43 +09:30
Oleg Nesterov
432870dab8 exit_notify: kill the wrong capable(CAP_KILL) check
The CAP_KILL check in exit_notify() looks just wrong, kill it.

Whatever logic we have to reset ->exit_signal, the malicious user
can bypass it if it execs the setuid application before exiting.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 14:57:23 -07:00
Linus Torvalds
cd5f9a4c31 kernel/sysctl.c: avoid annoying warnings
Some of the limit constants are used only depending on some complex
configuration dependencies, yet it's not worth making the simple
variables depend on those configuration details.  Just mark them as
perhaps not being unused, and avoid the warning.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 13:38:46 -07:00
Linus Torvalds
609862be07 Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: add stack dumps to asserts
  hrtimer: fix rq->lock inversion (again)
2009-04-06 13:37:30 -07:00
Linus Torvalds
12fe32e4f9 Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  kmemtrace: trace kfree() calls with NULL or zero-length objects
  kmemtrace: small cleanups
  kmemtrace: restore original tracing data binary format, improve ABI
  kmemtrace: kmemtrace_alloc() must fill type_id
  kmemtrace: use tracepoints
  kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints
  kmemtrace, rcu: fix rcupreempt.c data structure dependencies
  kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies
  kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies
  kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c
  kmemtrace, squashfs: fix slab.h dependency problem in squasfs
  kmemtrace, befs: fix slab.h dependency problem
  kmemtrace, security: fix linux/key.h header file dependencies
  kmemtrace, fs: fix linux/fdtable.h header file dependencies
  kmemtrace, fs: uninline simple_transaction_set()
  kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h
2009-04-06 13:30:00 -07:00
Darren Hart
52400ba946 futex: add requeue_pi functionality
PI Futexes and their underlying rt_mutex cannot be left ownerless if
there are pending waiters as this will break the PI boosting logic, so
the standard requeue commands aren't sufficient.  The new commands
properly manage pi futex ownership by ensuring a futex with waiters
has an owner at all times.  This will allow glibc to properly handle
pi mutexes with pthread_condvars.

The approach taken here is to create two new futex op codes:

FUTEX_WAIT_REQUEUE_PI:
Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
and wake after they have been requeued to a pi futex.  Prior to returning to
userspace, they will acquire this pi futex (and the underlying rt_mutex).

futex_wait_requeue_pi() is the result of a high speed collision between
futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
done by futex_proxy_trylock_atomic() on behalf of the top_waiter).

FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
regardless of how many tasks the caller intends to wake or requeue.
pthread_cond_broadcast() should call this with nr_wake=1 and
nr_requeue=INT_MAX.  pthread_cond_signal() should call this with nr_wake=1 and
nr_requeue=0.  The reason being we need both callers to get the benefit of the
futex_proxy_trylock_atomic() routine.  futex_requeue() also enqueues the
top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:03 +02:00
Darren Hart
f801073f87 futex: split out futex value validation code
Refactor the code to validate the expected futex value in order to
reuse it with the requeue_pi code.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:02 +02:00
Darren Hart
9121e4783c futex: distangle futex_requeue()
futex_requeue() is getting a bit long-winded, and will be getting more
so after the requeue_pi patch. Factor out the actual requeueing into a
nicely contained inline function to reduce function length and improve
legibility.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:02 +02:00
Darren Hart
a72188d8a6 futex: add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
Currently restart is only used if there is a timeout. The requeue_pi
functionality requires restarting to futex_lock_pi() on signal after
wakeup in futex_wait_requeue_pi() regardless of if there was a timeout
or not. Using 0 for the timeout value is confusing as that could
indicate an expired timer. The flag makes this explicit. While the
check is not technically needed in futex_wait_restart(), doing so
makes the code consistent with and will avoid confusion should the
need arise to restart wait without a timeout.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:02 +02:00
Darren Hart
8dac456a68 rt_mutex: add proxy lock routines
This patch is a prerequisite for futex requeue_pi. It basically splits
rt_mutex_slowlock() right down the middle, just before the first call
to schedule(). It further adds helper functions which make use of the
split and provide the rt-mutex preliminaries for futex requeue_pi.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:02 +02:00
Darren Hart
dd9739980b futex: split out fixup owner logic from futex_lock_pi()
Refactor the post lock acquisition logic from futex_lock_pi(). This
code will be reused in futex_wait_requeue_pi().

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:02 +02:00
Darren Hart
1a52084d09 futex: split out atomic logic from futex_lock_pi()
Refactor the atomic portion of futex_lock_pi() into futex_lock_pi_atomic().

This logic will be needed by requeue_pi, so modularize it to reduce
code duplication.  The only significant change is passing of the task
to try and take the lock for.  This simplifies the -EDEADLK test as if
the lock is owned by task t, it's a deadlock, regardless of if we are
doing requeue pi or not.  This patch updates the corresponding comment
accordingly.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:01 +02:00
Darren Hart
4b1c486b35 futex: add helper to find the top prio waiter of a futex
Improve legibility by wrapping finding the top waiter in a function.
This will be used by the follow-on patches for enabling requeue pi.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:01 +02:00
Darren Hart
ca5f9524d6 futex: separate futex_wait_queue_me() logic from futex_wait()
Refactor futex_wait() in preparation for futex_wait_requeue_pi(). In
order to reuse a good chunk of the futex_wait() code for the upcoming
futex_wait_requeue_pi() function, this patch breaks out the
queue-to-wakeup section of futex_wait() into futex_wait_queue_me().

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-06 11:14:01 +02:00
Peter Zijlstra
92f22a3865 perf_counter: update mmap() counter read
Paul noted that we don't need SMP barriers for the mmap() counter read
because its always on the same cpu (otherwise you can't access the hw
counter anyway).

So remove the SMP barriers and replace them with regular compiler
barriers.

Further, update the comment to include a race free method of reading
said hardware counter. The primary change is putting the pmc_read
inside the seq-loop, otherwise we can still race and read rubbish.

Noticed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Orig-LKML-Reference: <20090402091319.577951445@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:47 +02:00
Peter Zijlstra
5872bdb88a perf_counter: add more context information
Put in counts to tell which ips belong to what context.

  -----
   | |  hv
   | --
nr | |  kernel
   | --
   | |  user
  -----

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Orig-LKML-Reference: <20090402091319.493101305@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:46 +02:00
Peter Zijlstra
c457810ab4 perf_counter: per event wakeups
By request, provide a way to request a wakeup every 'n' events instead
of every page of output.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Orig-LKML-Reference: <20090402091319.323309784@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:45 +02:00
Peter Zijlstra
8a057d8491 perf_counter: move the event overflow output bits to record_type
Per suggestion from Paul, move the event overflow bits to record_type
and sanitize the enums a bit.

Breaks the ABI -- again ;-)

Suggested-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Orig-LKML-Reference: <20090402091319.151921176@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:45 +02:00
Peter Zijlstra
394ee07623 perf_counter: provide generic callchain bits
Provide the generic callchain support bits. If hw_event->callchain is
set the arch specific perf_callchain() function is called upon to
provide a perf_callchain_entry structure filled with the current
callchain.

If it does so, it is added to the overflow output event.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171024.254266860@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:43 +02:00
Peter Zijlstra
5ed00415e3 perf_counter: re-arrange the perf_event_type
Breaks ABI yet again :-)

Change the event type so that [0, 2^31-1] are regular event types, but
[2^31, 2^32-1] forms a bitmask for overflow events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171024.047961770@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:42 +02:00
Peter Zijlstra
78d613eb12 perf_counter: small cleanup of the output routines
Move the nmi argument to the _begin() function, so that _end() only needs the
handle. This allows the _begin() function to generate a wakeup on event loss.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171023.959404268@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:41 +02:00
Paul Mackerras
d5d2bc0dd0 perf_counter: make it possible for hw_perf_counter_init to return error codes
Impact: better error reporting

At present, if hw_perf_counter_init encounters an error, all it can do
is return NULL, which causes sys_perf_counter_open to return an EINVAL
error to userspace.  This isn't very informative for userspace; it means
that userspace can't tell the difference between "sorry, oprofile is
already using the PMU" and "we don't support this CPU" and "this CPU
doesn't support the requested generic hardware event".

This commit uses the PTR_ERR/ERR_PTR/IS_ERR set of macros to let
hw_perf_counter_init return an error code on error rather than just NULL
if it wishes.  If it does so, that error code will be returned from
sys_perf_counter_open to userspace.  If it returns NULL, an EINVAL
error will be returned to userspace, as before.

This also adapts the powerpc hw_perf_counter_init to make use of this
to return ENXIO, EINVAL, EBUSY, or EOPNOTSUPP as appropriate.  It would
be good to add extra error numbers in future to allow userspace to
distinguish the various errors that are currently reported as EINVAL,
i.e. irq_period < 0, too many events in a group, conflict between
exclude_* settings in a group, and PMU resource conflict in a group.

[ v2: fix a bug pointed out by Corey Ashford where error returns from
      hw_perf_counter_init were not handled correctly in the case of
      raw hardware events.]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Orig-LKML-Reference: <20090330171023.682428180@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:40 +02:00
Peter Zijlstra
0a4a93919b perf_counter: executable mmap() information
Currently the profiling information returns userspace IPs but no way
to correlate them to userspace code. Userspace could look into
/proc/$pid/maps but that might not be current or even present anymore
at the time of analyzing the IPs.

Therefore provide means to track the mmap information and provide it
in the output stream.

XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Orig-LKML-Reference: <20090330171023.417259499@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:38 +02:00
Peter Zijlstra
38ff667b32 perf_counter: fix update_userpage()
It just occured to me it is possible to have multiple contending
updates of the userpage (mmap information vs overflow vs counter).
This would break the seqlock logic.

It appear the arch code uses this from NMI context, so we cannot
possibly serialize its use, therefore separate the data_head update
from it and let it return to its original use.

The arch code needs to make sure there are no contending callers by
disabling the counter before using it -- powerpc appears to do this
nicely.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171023.241410660@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:37 +02:00
Peter Zijlstra
925d519ab8 perf_counter: unify and fix delayed counter wakeup
While going over the wakeup code I noticed delayed wakeups only work
for hardware counters but basically all software counters rely on
them.

This patch unifies and generalizes the delayed wakeup to fix this
issue.

Since we're dealing with NMI context bits here, use a cmpxchg() based
single link list implementation to track counters that have pending
wakeups.

[ This should really be generic code for delayed wakeups, but since we
  cannot use cmpxchg()/xchg() in generic code, I've let it live in the
  perf_counter code. -- Eric Dumazet could use it to aggregate the
  network wakeups. ]

Furthermore, the x86 method of using TIF flags was flawed in that its
quite possible to end up setting the bit on the idle task, loosing the
wakeup.

The powerpc method uses per-cpu storage and does appear to be
sufficient.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090330171023.153932974@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:36 +02:00
Paul Mackerras
53cfbf5937 perf_counter: record time running and time enabled for each counter
Impact: new functionality

Currently, if there are more counters enabled than can fit on the CPU,
the kernel will multiplex the counters on to the hardware using
round-robin scheduling.  That isn't too bad for sampling counters, but
for counting counters it means that the value read from a counter
represents some unknown fraction of the true count of events that
occurred while the counter was enabled.

This remedies the situation by keeping track of how long each counter
is enabled for, and how long it is actually on the cpu and counting
events.  These times are recorded in nanoseconds using the task clock
for per-task counters and the cpu clock for per-cpu counters.

These values can be supplied to userspace on a read from the counter.
Userspace requests that they be supplied after the counter value by
setting the PERF_FORMAT_TOTAL_TIME_ENABLED and/or
PERF_FORMAT_TOTAL_TIME_RUNNING bits in the hw_event.read_format field
when creating the counter.  (There is no way to change the read format
after the counter is created, though it would be possible to add some
way to do that.)

Using this information it is possible for userspace to scale the count
it reads from the counter to get an estimate of the true count:

true_count_estimate = count * total_time_enabled / total_time_running

This also lets userspace detect the situation where the counter never
got to go on the cpu: total_time_running == 0.

This functionality has been requested by the PAPI developers, and will
be generally needed for interpreting the count values from counting
counters correctly.

In the implementation, this keeps 5 time values (in nanoseconds) for
each counter: total_time_enabled and total_time_running are used when
the counter is in state OFF or ERROR and for reporting back to
userspace.  When the counter is in state INACTIVE or ACTIVE, it is the
tstamp_enabled, tstamp_running and tstamp_stopped values that are
relevant, and total_time_enabled and total_time_running are determined
from them.  (tstamp_stopped is only used in INACTIVE state.)  The
reason for doing it like this is that it means that only counters
being enabled or disabled at sched-in and sched-out time need to be
updated.  There are no new loops that iterate over all counters to
update total_time_enabled or total_time_running.

This also keeps separate child_total_time_running and
child_total_time_enabled fields that get added in when reporting the
totals to userspace.  They are separate fields so that they can be
atomic.  We don't want to use atomics for total_time_running,
total_time_enabled etc., because then we would have to use atomic
sequences to update them, which are slower than regular arithmetic and
memory accesses.

It is possible to measure total_time_running by adding a task_clock
counter to each group of counters, and total_time_enabled can be
measured approximately with a top-level task_clock counter (though
inaccuracies will creep in if you need to disable and enable groups
since it is not possible in general to disable/enable the top-level
task_clock counter simultaneously with another group).  However, that
adds extra overhead - I measured around 15% increase in the context
switch latency reported by lat_ctx (from lmbench) when a task_clock
counter was added to each of 2 groups, and around 25% increase when a
task_clock counter was added to each of 4 groups.  (In both cases a
top-level task-clock counter was also added.)

In contrast, the code added in this commit gives better information
with no overhead that I could measure (in fact in some cases I
measured lower times with this code, but the differences were all less
than one standard deviation).

[ v2: address review comments by Andrew Morton. ]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Orig-LKML-Reference: <18890.6578.728637.139402@cargo.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:36 +02:00
Peter Zijlstra
7730d86558 perf_counter: allow and require one-page mmap on counting counters
A brainfart stopped single page mmap()s working. The rest of the code
should be perfectly fine with not having any data pages.

Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Orig-LKML-Reference: <1237981712.7972.812.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:35 +02:00
Peter Zijlstra
ea5d20cf99 perf_counter: optionally provide the pid/tid of the sampled task
Allow cpu wide counters to profile userspace by providing what process
the sample belongs to.

This raises the first issue with the output type, lots of these
options: group, tid, callchain, etc.. are non-exclusive and could be
combined, suggesting a bitfield.

However, things like the mmap() data stream doesn't fit in that.

How to split the type field...

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Orig-LKML-Reference: <20090325113317.013775235@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:34 +02:00
Peter Zijlstra
63e35b25d6 perf_counter: sanity check on the output API
Ensure we never write more than we said we would.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Orig-LKML-Reference: <20090325113316.921433024@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:33 +02:00
Peter Zijlstra
5c14819432 perf_counter: output objects
Provide a {type,size} header for each output entry.

This should provide extensible output, and the ability to mix multiple streams.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Orig-LKML-Reference: <20090325113316.831607932@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:33 +02:00
Peter Zijlstra
b9cacc7bf1 perf_counter: more elaborate write API
Provide a begin, copy, end interface to the output buffer.

begin() reserves the space,
 copy() copies the data over, considering page boundaries,
  end() finalizes the event and does the wakeup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Orig-LKML-Reference: <20090325113316.740550870@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:32 +02:00
Peter Zijlstra
c7138f37f9 perf_counter: fix perf_poll()
Impact: fix kerneltop 100% CPU usage

Only return a poll event when there's actually been one, poll_wait()
doesn't actually wait for the waitq you pass it, it only enqueues
you on it.

Only once all FDs have been iterated and none of thm returned a
poll-event will it schedule().

Also make it return POLL_HUP when there's not mmap() area to read from.

Further, fix a silly bug in the write code.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Orig-LKML-Reference: <1237897096.24918.181.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:32 +02:00
Peter Zijlstra
7b732a7504 perf_counter: new output ABI - part 1
Impact: Rework the perfcounter output ABI

use sys_read() only for instant data and provide mmap() output for all
async overflow data.

The first mmap() determines the size of the output buffer. The mmap()
size must be a PAGE_SIZE multiple of 1+pages, where pages must be a
power of 2 or 0. Further mmap()s of the same fd must have the same
size. Once all maps are gone, you can again mmap() with a new size.

In case of 0 extra pages there is no data output and the first page
only contains meta data.

When there are data pages, a poll() event will be generated for each
full page of data. Furthermore, the output is circular. This means
that although 1 page is a valid configuration, its useless, since
we'll start overwriting it the instant we report a full page.

Future work will focus on the output format (currently maintained)
where we'll likey want each entry denoted by a header which includes a
type and length.

Further future work will allow to splice() the fd, also containing the
async overflow data -- splice() would be mutually exclusive with
mmap() of the data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090323172417.470536358@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:27 +02:00
H. Peter Anvin
b09d2501ed mutex: drop "inline" from mutex_lock() inside kernel/mutex.c
Impact: build fix

mutex_lock() is was defined inline in kernel/mutex.c, but wasn't
declared so not in <linux/mutex.h>.  This didn't cause a problem until
checkin 3a2d367d9aabac486ac4444c6c7ec7a1dab16267 added the
atomic_dec_and_mutex_lock() inline in between declaration and
definion.

This broke building with CONFIG_ALLOW_WARNINGS=n, e.g. make
allnoconfig.

Either from the source code nor the allnoconfig binary output I cannot
find any internal references to mutex_lock() in kernel/mutex.c, so
presumably this "inline" is now-useless legacy.

Cc: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <tip-3a2d367d9aabac486ac4444c6c7ec7a1dab16267@git.kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-04-06 09:30:27 +02:00
Paul Mackerras
37d8182838 perf_counter: add an mmap method to allow userspace to read hardware counters
Impact: new feature giving performance improvement

This adds the ability for userspace to do an mmap on a hardware counter
fd and get access to a read-only page that contains the information
needed to translate a hardware counter value to the full 64-bit
counter value that would be returned by a read on the fd.  This is
useful on architectures that allow user programs to read the hardware
counters, such as PowerPC.

The mmap will only succeed if the counter is a hardware counter
monitoring the current process.

On my quad 2.5GHz PowerPC 970MP machine, userspace can read a counter
and translate it to the full 64-bit value in about 30ns using the
mmapped page, compared to about 830ns for the read syscall on the
counter, so this does give a significant performance improvement.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Orig-LKML-Reference: <20090323172417.297057964@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:26 +02:00
Peter Zijlstra
96f6d44443 perf_counter: avoid recursion
Tracepoint events like lock_acquire and software counters like
pagefaults can recurse into the perf counter code again, avoid that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090323172417.152096433@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:25 +02:00
Peter Zijlstra
f4a2deb486 perf_counter: remove the event config bitfields
Since the bitfields turned into a bit of a mess, remove them and rely on
good old masks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Orig-LKML-Reference: <20090323172417.059499915@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:25 +02:00
Peter Zijlstra
0322cd6ec5 perf_counter: unify irq output code
Impact: cleanup

Having 3 slightly different copies of the same code around does nobody
any good. First step in revamping the output format.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.929962222@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:17 +02:00
Peter Zijlstra
b8e83514b6 perf_counter: revamp syscall input ABI
Impact: modify ABI

The hardware/software classification in hw_event->type became a little
strained due to the addition of tracepoint tracing.

Instead split up the field and provide a type field to explicitly specify
the counter type, while using the event_id field to specify which event to
use.

Raw counters still work as before, only the raw config now goes into
raw_event.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.836807573@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:17 +02:00
Peter Zijlstra
e077df4f43 perf_counter: hook up the tracepoint events
Impact: new perfcounters feature

Enable usage of tracepoints as perf counter events.

tracepoint event ids can be found in /debug/tracing/event/*/*/id
and (for now) are represented as -65536+id in the type field.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.744044174@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:16 +02:00
Peter Zijlstra
f160095275 perf_counter: fix up counter free paths
Impact: fix crash during perfcounters use

I found another counter free path, create a free_counter() call to
accomodate generic tear-down.

Fixes an RCU bug.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.652078652@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:16 +02:00
Peter Zijlstra
4a0deca657 perf_counter: generic context switch event
Impact: cleanup

Use the generic software events for context switches.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.283522645@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:15 +02:00
Peter Zijlstra
01ef09d9ff perf_counter: fix uninitialized usage of event_list
Impact: fix boot crash

When doing the generic context switch event I ran into some early
boot hangs, which were caused by inf func recursion (event, fault,
event, fault).

I eventually tracked it down to event_list not being initialized
at the time of the first event. Fix this.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Orig-LKML-Reference: <20090319194233.195392657@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:15 +02:00
Paul Mackerras
b6c5a71da1 perf_counter: abstract wakeup flag setting in core to fix powerpc build
Impact: build fix for powerpc

Commit bd753921015e7905 ("perf_counter: software counter event
infrastructure") introduced a use of TIF_PERF_COUNTERS into the core
perfcounter code.  This breaks the build on powerpc because we use
a flag in a per-cpu area to signal wakeups on powerpc rather than
a thread_info flag, because the thread_info flags have to be
manipulated with atomic operations and are thus slower than per-cpu
flags.

This fixes the by changing the core to use an abstracted
set_perf_counter_pending() function, which is defined on x86 to set
the TIF_PERF_COUNTERS flag and on powerpc to set the per-cpu flag
(paca->perf_counter_pending).  It changes the previous powerpc
definition of set_perf_counter_pending to not take an argument and
adds a clear_perf_counter_pending, so as to simplify the definition
on x86.

On x86, set_perf_counter_pending() is defined as a macro.  Defining
it as a static inline in arch/x86/include/asm/perf_counters.h causes
compile failures because <asm/perf_counters.h> gets included early in
<linux/sched.h>, and the definitions of set_tsk_thread_flag etc. are
therefore not available in <asm/perf_counters.h>.  (On powerpc this
problem is avoided by defining set_perf_counter_pending etc. in
<asm/hw_irq.h>.)

Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-04-06 09:30:14 +02:00
Tim Blechmann
4e193bd4df perf_counter: include missing header
Impact: build fix

In order to compile a kernel with performance counter patches,
<asm/irq_regs.h> has to be included to provide the declaration of
struct pt_regs *get_irq_regs(void);

[ This bug was masked by unrelated x86 header file changes in the
  x86 tree, but occurs in the tip:perfcounters/core standalone
  tree. ]

Signed-off-by: Tim Blechmann <tim@klingt.org>
Orig-LKML-Reference: <20090314142925.49c29c17@thinkpad>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:30:13 +02:00
Peter Zijlstra
039fc91e06 perf_counter: fix hrtimer sampling
Impact: fix deadlock with perfstat

Fix for the perfstat fubar..

We cannot unconditionally call hrtimer_cancel() without ever having done
hrtimer_init() on the thing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Orig-LKML-Reference: <1236959027.22447.149.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:29:46 +02:00
Peter Zijlstra
592903cdcb perf_counter: add an event_list
I noticed that the counter_list only includes top-level counters, thus
perf_swcounter_event() will miss sw-counters in groups.

Since perf_swcounter_event() also wants an RCU safe list, create a new
event_list that includes all counters and uses RCU list ops and use call_rcu
to free the counter structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:29:43 +02:00
Peter Zijlstra
d6d020e995 perf_counter: hrtimer based sampling for software time events
Use hrtimers to profile timer based sampling for the software time
counters.

This allows platforms without hardware counter support to still
perform sample based profiling.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:29:41 +02:00
Peter Zijlstra
ac17dc8e58 perf_counter: provide major/minor page fault software events
Provide separate sw counters for major and minor page faults.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:29:40 +02:00
Peter Zijlstra
7dd1fcc258 perf_counter: provide pagefault software events
We use the generic software counter infrastructure to provide
page fault events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:29:37 +02:00
Peter Zijlstra
15dbf27cc1 perf_counter: software counter event infrastructure
Provide generic software counter infrastructure that supports
software events.

This will be used to allow sample based profiling based on software
events such as pagefaults. The current infrastructure can only
provide a count of such events, no place information.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:29:36 +02:00
Peter Zijlstra
755642322a perf_counter: use list_move_tail()
Instead of del/add use a move list-op.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:29:31 +02:00
Ingo Molnar
f541ae326f Merge branch 'linus' into perfcounters/core-v2
Merge reason: we have gathered quite a few conflicts, need to merge upstream

Conflicts:
	arch/powerpc/kernel/Makefile
	arch/x86/ia32/ia32entry.S
	arch/x86/include/asm/hardirq.h
	arch/x86/include/asm/unistd_32.h
	arch/x86/include/asm/unistd_64.h
	arch/x86/kernel/cpu/common.c
	arch/x86/kernel/irq.c
	arch/x86/kernel/syscall_table_32.S
	arch/x86/mm/iomap_32.c
	include/linux/sched.h
	kernel/Makefile

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-06 09:02:57 +02:00
Ingo Molnar
9efe21cb82 Merge branch 'linus' into irq/threaded
Conflicts:
	include/linux/irq.h
	kernel/irq/handle.c
2009-04-06 01:41:22 +02:00
Linus Torvalds
0221c81b1b Merge branch 'audit.b62' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b62' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  Audit: remove spaces from audit_log_d_path
  audit: audit_set_auditable defined but not used
  audit: incorrect ref counting in audit tree tag_chunk
  audit: Fix possible return value truncation in audit_get_context()
  audit: ignore terminating NUL in AUDIT_USER_TTY messages
  Audit: fix handling of 'strings' with NULL characters
  make the e->rule.xxx shorter in kernel auditfilter.c
  auditsc: fix kernel-doc notation
  audit: EXECVE record - removed bogus newline
2009-04-05 12:36:11 -07:00
Linus Torvalds
714f83d5d9 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
  tracing, net: fix net tree and tracing tree merge interaction
  tracing, powerpc: fix powerpc tree and tracing tree interaction
  ring-buffer: do not remove reader page from list on ring buffer free
  function-graph: allow unregistering twice
  trace: make argument 'mem' of trace_seq_putmem() const
  tracing: add missing 'extern' keywords to trace_output.h
  tracing: provide trace_seq_reserve()
  blktrace: print out BLK_TN_MESSAGE properly
  blktrace: extract duplidate code
  blktrace: fix memory leak when freeing struct blk_io_trace
  blktrace: fix blk_probes_ref chaos
  blktrace: make classic output more classic
  blktrace: fix off-by-one bug
  blktrace: fix the original blktrace
  blktrace: fix a race when creating blk_tree_root in debugfs
  blktrace: fix timestamp in binary output
  tracing, Text Edit Lock: cleanup
  tracing: filter fix for TRACE_EVENT_FORMAT events
  ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
  x86: kretprobe-booster interrupt emulation code fix
  ...

Fix up trivial conflicts in
 arch/parisc/include/asm/ftrace.h
 include/linux/memory.h
 kernel/extable.c
 kernel/module.c
2009-04-05 11:04:19 -07:00
Eric Paris
def5754341 Audit: remove spaces from audit_log_d_path
audit_log_d_path had spaces in the strings which would be emitted on the
error paths.  This patch simply replaces those spaces with an _ or removes
the needless spaces entirely.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:49:04 -04:00
Eric Paris
679173b724 audit: audit_set_auditable defined but not used
after 0590b9335a audit_set_auditable() is now only
used by the audit tree code.  If CONFIG_AUDIT_TREE is unset it will be defined
but unused.  This patch simply moves the function inside a CONFIG_AUDIT_TREE
block.

cc1: warnings being treated as errors
/home/acme_unencrypted/git/linux-2.6-tip/kernel/auditsc.c:745: error: ‘audit_set_auditable’ defined but not used
make[2]: *** [kernel/auditsc.o] Error 1
make[1]: *** [kernel] Error 2
make[1]: *** Waiting for unfinished jobs....

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:48:52 -04:00
Eric Paris
318b6d3d7d audit: incorrect ref counting in audit tree tag_chunk
tag_chunk has bad exit paths in which the inotify ref counting is wrong.
At the top of the function we found &old_watch using  inotify_find_watch().
inotify_find_watch takes a reference to the watch.  This is never dropped
on an error path.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:48:26 -04:00
Paul Moore
6d208da89a audit: Fix possible return value truncation in audit_get_context()
The audit subsystem treats syscall return codes as type long, unfortunately
the audit_get_context() function mistakenly converts the return code to an
int type in the parameters which could cause problems on systems where the
sizeof(int) != sizeof(long).

Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:46:19 -04:00
Miloslav Trmac
55ad2f8d34 audit: ignore terminating NUL in AUDIT_USER_TTY messages
AUDIT_USER_TTY, like all other messages sent from user-space, is sent
NUL-terminated.  Unlike other user-space audit messages, which come only
from trusted sources, AUDIT_USER_TTY messages are processed using
audit_log_n_untrustedstring().

This patch modifies AUDIT_USER_TTY handling to ignore the trailing NUL
and use the "quoted_string" representation of the message if possible.

Signed-off-by: Miloslav Trmac <mitr@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:43:36 -04:00
Miloslav Trmac
b3897f5671 Audit: fix handling of 'strings' with NULL characters
currently audit_log_n_untrustedstring() uses audit_string_contains_control()
to check if the 'string' has any control characters.  If the 'string' has an
embedded NULL audit_string_contains_control() will return that the data has
no control characters and will then pass the string to audit_log_n_string
with the total length, not the length up to the first NULL.
audit_log_n_string() does a memcpy of the entire length and so the actual
audit record emitted may then contain a NULL and then whatever random memory
is after the NULL.

Since we want to log the entire octet stream (if we can't trust the data
to be a string we can't trust that a NULL isn't actually a part of it)
we should just consider NULL as a control character.  If the caller is
certain they want to stop at the first NULL they should be using
audit_log_untrustedstring.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:43:24 -04:00
Zhenwen Xu
c28bb7da74 make the e->rule.xxx shorter in kernel auditfilter.c
make the e->rule.xxx shorter in kernel/auditfilter.c
--
---------------------------------
Zhenwen Xu - Open and Free
Home Page:	http://zhwen.org
My Studio:	http://dim4.cn

>From 99692dc640b278f1cb1a15646ce42f22e89c0f77 Mon Sep 17 00:00:00 2001
From: Zhenwen Xu <Helight.Xu@gmail.com>
Date: Thu, 12 Mar 2009 22:04:59 +0800
Subject: [PATCH] make the e->rule.xxx shorter in kernel/auditfilter.c

Signed-off-by: Zhenwen Xu <Helight.Xu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:40:33 -04:00
Randy Dunlap
6b96255998 auditsc: fix kernel-doc notation
Fix auditsc kernel-doc notation:

Warning(linux-2.6.28-git7//kernel/auditsc.c:2156): No description found for parameter 'attr'
Warning(linux-2.6.28-git7//kernel/auditsc.c:2156): Excess function parameter 'u_attr' description in '__audit_mq_open'
Warning(linux-2.6.28-git7//kernel/auditsc.c:2204): No description found for parameter 'notification'
Warning(linux-2.6.28-git7//kernel/auditsc.c:2204): Excess function parameter 'u_notification' description in '__audit_mq_notify'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
cc:	Al Viro <viro@zeniv.linux.org.uk>
cc:	Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:39:19 -04:00
Jiri Pirko
ca96a895a6 audit: EXECVE record - removed bogus newline
(updated)
Added hunk that changes the comment, the rest is the same.

EXECVE records contain a newline after every argument. auditd converts
"\n" to " " so you cannot see newlines even in raw logs, but they're
there nevertheless. If you're not using auditd, you need to work round
them. These '\n' chars are can be easily replaced by spaces when
creating record in kernel. Note there is no need for trailing '\n' in
an audit record.

record before this patch:
"type=EXECVE msg=audit(1231421801.566:31): argc=4 a0=\"./test\"\na1=\"a\"\na2=\"b\"\na3=\"c\"\n"

record after this patch:
"type=EXECVE msg=audit(1231421801.566:31): argc=4 a0=\"./test\" a1=\"a\" a2=\"b\" a3=\"c\""

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-05 13:38:59 -04:00
Linus Torvalds
90975ef712 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
  cpumask: remove cpumask allocation from idle_balance, fix
  numa, cpumask: move numa_node_id default implementation to topology.h, fix
  cpumask: remove cpumask allocation from idle_balance
  x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
  x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
  x86: microcode: cleanup
  x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
  cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
  numa, cpumask: move numa_node_id default implementation to topology.h
  cpumask: convert node_to_cpumask_map[] to cpumask_var_t
  cpumask: remove x86 cpumask_t uses.
  cpumask: use cpumask_var_t in uv_flush_tlb_others.
  cpumask: remove cpumask_t assignment from vector_allocation_domain()
  cpumask: make Xen use the new operators.
  cpumask: clean up summit's send_IPI functions
  cpumask: use new cpumask functions throughout x86
  x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
  cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
  cpumask: convert node_to_cpumask_map[] to cpumask_var_t
  x86: unify 32 and 64-bit node_to_cpumask_map
  ...
2009-04-05 10:33:07 -07:00
Linus Torvalds
cab4e4c43f Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-module-and-param
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-module-and-param:
  module: use strstarts()
  strstarts: helper function for !strncmp(str, prefix, strlen(prefix))
  arm: allow usage of string functions in linux/string.h
  module: don't use stop_machine on module load
  module: create a request_module_nowait()
  module: include other structures in module version check
  module: remove the SHF_ALLOC flag on the __versions section.
  module: clarify the force-loading taint message.
  module: Export symbols needed for Ksplice
  Ksplice: Add functions for walking kallsyms symbols
  module: remove module_text_address()
  module: __module_address
  module: Make find_symbol return a struct kernel_symbol
  kernel/module.c: fix an unused goto label
  param: fix charp parameters set via sysfs

Fix trivial conflicts in kernel/extable.c manually.
2009-04-05 10:30:21 -07:00
Linus Torvalds
e4c393fd55 Merge branch 'printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  printk: correct the behavior of printk_timed_ratelimit()
  vsprintf: unify the format decoding layer for its 3 users, cleanup
  fix regression from "vsprintf: unify the format decoding layer for its 3 users"
  vsprintf: fix bug in negative value printing
  vsprintf: unify the format decoding layer for its 3 users
  vsprintf: add binary printf
  printk: introduce printk_once()

Fix trivial conflicts (printk_once vs log_buf_kexec_setup() added near
each other) in include/linux/kernel.h.
2009-04-05 10:23:25 -07:00
Linus Torvalds
09f38dc19d Merge branch 'core-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ptrace: remove a useless goto
2009-04-03 17:35:06 -07:00
Linus Torvalds
30a39e0e97 Merge branch 'stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  symbols, stacktrace: look up init symbols after module symbols
2009-04-03 17:34:41 -07:00
Linus Torvalds
c7edad5fcb Merge branch 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcu: rcu_barrier VS cpu_hotplug: Ensure callbacks in dead cpu are migrated to online cpu
2009-04-03 17:34:12 -07:00
Linus Torvalds
b1dbb67911 Merge branch 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  s390: remove arch specific smp_send_stop()
  panic: clean up kernel/panic.c
  panic, smp: provide smp_send_stop() wrapper on UP too
  panic: decrease oops_in_progress only after having done the panic
  generic-ipi: eliminate WARN_ON()s during oops/panic
  generic-ipi: cleanups
  generic-ipi: remove CSD_FLAG_WAIT
  generic-ipi: remove kmalloc()
  generic IPI: simplify barriers and locking
2009-04-03 17:33:30 -07:00
Linus Torvalds
492f59f526 Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]
  lockdep: remove duplicate CONFIG_DEBUG_LOCKDEP definitions
  lockdep: require framepointers for x86
  lockdep: remove extra "irq" string
  lockdep: fix incorrect state name
2009-04-03 17:29:53 -07:00
Linus Torvalds
811158b147 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  trivial: Update my email address
  trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
  trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
  trivial: Fix misspelling of "Celsius".
  trivial: remove unused variable 'path' in alloc_file()
  trivial: fix a pdlfush -> pdflush typo in comment
  trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
  trivial: wusb: Storage class should be before const qualifier
  trivial: drivers/char/bsr.c: Storage class should be before const qualifier
  trivial: h8300: Storage class should be before const qualifier
  trivial: fix where cgroup documentation is not correctly referred to
  trivial: Give the right path in Documentation example
  trivial: MTD: remove EOL from MODULE_DESCRIPTION
  trivial: Fix typo in bio_split()'s documentation
  trivial: PWM: fix of #endif comment
  trivial: fix typos/grammar errors in Kconfig texts
  trivial: Fix misspelling of firmware
  trivial: cgroups: documentation typo and spelling corrections
  trivial: Update contact info for Jochen Hein
  trivial: fix typo "resgister" -> "register"
  ...
2009-04-03 15:24:35 -07:00
Yinghai Lu
9756b15e1b irq: fix cpumask memory leak on offstack cpumask kernels
Need to free the old cpumask for affinity and pending_mask.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <49D18FF0.50707@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 19:14:44 +02:00
David Howells
8f0aa2f25b Document the slow work thread pool
Document the slow work thread pool.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03 16:42:35 +01:00
David Howells
12e22c5e4b Make the slow work pool configurable
Make the slow work pool configurable through /proc/sys/kernel/slow-work.

 (*) /proc/sys/kernel/slow-work/min-threads

     The minimum number of threads that should be in the pool as long as it is
     in use.  This may be anywhere between 2 and max-threads.

 (*) /proc/sys/kernel/slow-work/max-threads

     The maximum number of threads that should in the pool.  This may be
     anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater.

 (*) /proc/sys/kernel/slow-work/vslow-percentage

     The percentage of active threads in the pool that may be used to execute
     very slow work items.  This may be between 1 and 99.  The resultant number
     is bounded to between 1 and one fewer than the number of active threads.
     This ensures there is always at least one thread that can process very
     slow work items, and always at least one thread that won't.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03 16:42:35 +01:00
David Howells
109d9272c4 Make slow-work thread pool actually dynamic
Make the slow-work thread pool actually dynamic in the number of threads it
contains.  With this patch, it will both create additional threads when it has
extra work to do, and cull excess threads that aren't doing anything.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03 16:42:35 +01:00
David Howells
07fe7cb7c7 Create a dynamically sized pool of threads for doing very slow work items
Create a dynamically sized pool of threads for doing very slow work items, such
as invoking mkdir() or rmdir() - things that may take a long time and may
sleep, holding mutexes/semaphores and hogging a thread, and are thus unsuitable
for workqueues.

The number of threads is always at least a settable minimum, but more are
started when there's more work to do, up to a limit.  Because of the nature of
the load, it's not suitable for a 1-thread-per-CPU type pool.  A system with
one CPU may well want several threads.

This is used by FS-Cache to do slow caching operations in the background, such
as looking up, creating or deleting cache objects.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
2009-04-03 16:42:35 +01:00
Li Zefan
e2494e1b42 blktrace: fix pdu_len when tracing packet command requests
Impact: output all of packet commands - not just the first 4 / 8 bytes

Since commit d7e3c3249e ("block: add
large command support"), struct request->cmd has been changed from
unsinged char cmd[BLK_MAX_CDB] to unsigned char *cmd.

v1 -> v2: by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>

- make sure rq->cmd_len is always intialized, and then we can use
  rq->cmd_len instead of BLK_MAX_CDB.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
LKML-Reference: <49D4507E.2060602@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 15:29:26 +02:00
Li Zefan
7635b03adf blktrace: small cleanup in blk_msg_write()
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "Alan D. Brunelle" <alan.brunelle@hp.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
LKML-Reference: <49D5BB56.7000807@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 14:48:11 +02:00
Carl Henrik Lunde
a4b3ada83d blktrace: NUL-terminate user space messages
Impact: fix corrupted blkparse output

Make sure messages from user space are NUL-terminated strings,
otherwise we could dump random memory to the block trace file.

Additionally, I've limited the message to BLK_TN_MAX_MSG-1
characters, because the last character would be stripped by
vscnprintf anyway.

Signed-off-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "Alan D. Brunelle" <alan.brunelle@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20090403122714.GT5178@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 14:46:22 +02:00
Ingo Molnar
c826e3cd0c kmemtrace: small cleanups
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
LKML-Reference: <161be9ca8a27b432c4a6ab79f47788c4521652ae.1237813499.git.eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 12:23:09 +02:00
Eduard - Gabriel Munteanu
42af9054c0 kmemtrace: restore original tracing data binary format, improve ABI
When kmemtrace was ported to ftrace, the marker strings were taken as
an indication of how the traced data was being exposed to the userspace.
However, the actual format had been binary, not text.

This restores the original binary format, while also adding an origin CPU
field (since ftrace doesn't expose the data per-CPU to userspace), and
re-adding the timestamp field. It also drops arch-independent field
sizing where it didn't make sense, so pointers won't always be 64 bits
wide like they used to.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
LKML-Reference: <161be9ca8a27b432c4a6ab79f47788c4521652ae.1237813499.git.eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 12:23:08 +02:00
Eduard - Gabriel Munteanu
da2635a985 kmemtrace: kmemtrace_alloc() must fill type_id
Impact: fix trace output

kmemtrace_alloc() was not filling type_id, which allowed garbage to make
it into tracing data.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
LKML-Reference: <284dba2732a144849d5aa82258fe0de2ad8dcb0b.1237813499.git.eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 12:23:07 +02:00
Eduard - Gabriel Munteanu
ca2b84cb3c kmemtrace: use tracepoints
kmemtrace now uses tracepoints instead of markers. We no longer need to
use format specifiers to pass arguments.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
[ folded: Use the new TP_PROTO and TP_ARGS to fix the build.     ]
[ folded: fix build when CONFIG_KMEMTRACE is disabled.           ]
[ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <ae61c0f37156db8ec8dc0d5778018edde60a92e3.1237813499.git.eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 12:23:06 +02:00
Ingo Molnar
a979241c53 kmemtrace, rcu: fix rcupreempt.c data structure dependencies
Impact: cleanup

We want to remove percpu.h from rcupreempt.h, but if we do that
the percpu primitives there wont build anymore. Move them to the
.c file instead.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1237898630.25315.83.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 12:23:04 +02:00
Ingo Molnar
6258c4fb59 kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies
Impact: cleanup

We want to remove rcutree internals from the public rcutree.h file for
upcoming kmemtrace changes - but kernel/rcutree_trace.c depends on them.

Introduce kernel/rcutree.h for internal definitions. (Probably all
the other data types from include/linux/rcutree.h could be
moved here too - except rcu_data.)

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1237898630.25315.83.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 12:23:03 +02:00
Ingo Molnar
b1f77b0581 kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies
Impact: build fix for all non-x86 architectures

We want to remove percpu.h from rcuclassic.h/rcutree.h (for upcoming
kmemtrace changes) but that would break the DECLARE_PER_CPU based
declarations in these files.

Move the quiescent counter management functions to their respective
RCU implementation .c files - they were slightly above the inlining
limit anyway.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1237898630.25315.83.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-03 12:23:02 +02:00
Linus Torvalds
8fe74cf053 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  Remove two unneeded exports and make two symbols static in fs/mpage.c
  Cleanup after commit 585d3bc06f
  Trim includes of fdtable.h
  Don't crap into descriptor table in binfmt_som
  Trim includes in binfmt_elf
  Don't mess with descriptor table in load_elf_binary()
  Get rid of indirect include of fs_struct.h
  New helper - current_umask()
  check_unsafe_exec() doesn't care about signal handlers sharing
  New locking/refcounting for fs_struct
  Take fs_struct handling to new file (fs/fs_struct.c)
  Get rid of bumping fs_struct refcount in pivot_root(2)
  Kill unsharing fs_struct in __set_personality()
2009-04-02 21:09:10 -07:00
Robin Holt
f5f7eac41d Allow rwlocks to re-enable interrupts
Pass the original flags to rwlock arch-code, so that it can re-enable
interrupts if implemented for that architecture.

Initially, make __raw_read_lock_flags and __raw_write_lock_flags stubs
which just do the same thing as non-flags variants.

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Robin Holt <holt@sgi.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:11 -07:00
Robin Holt
e8c158bb31 Factor out #ifdefs from kernel/spinlock.c to LOCK_CONTENDED_FLAGS
SGI has observed that on large systems, interrupts are not serviced for a
long period of time when waiting for a rwlock.  The following patch series
re-enables irqs while waiting for the lock, resembling the code which is
already there for spinlocks.

I only made the ia64 version, because the patch adds some overhead to the
fast path.  I assume there is currently no demand to have this for other
architectures, because the systems are not so large.  Of course, the
possibility to implement raw_{read|write}_lock_flags for any architecture
is still there.

This patch:

The new macro LOCK_CONTENDED_FLAGS expands to the correct implementation
depending on the config options, so that IRQ's are re-enabled when
possible, but they remain disabled if CONFIG_LOCKDEP is set.

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:10 -07:00
Aravind Srinivasan
2c53d9109f relay: fix for possible loss/corruption of produced subbufs
Fix possible loss/corruption of produced subbufs in
relay_subbufs_consumed().

When buf->subbufs_produced wraps around after UINT_MAX and
buf->subbufs_consumed is still < UINT_MAX, the condition

	if (buf->subbufs_consumed > buf->subbufs_produced)

will be true even for certain valid values of subbufs_consumed.  This may
lead to loss or corruption of produced subbufs.

Signed-off-by: Aravind Srinivasan <raa.aars@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:05 -07:00
Dmitri Vorobiev
edb79a2132 kexec: vmcoreinfo_data[] can become static
The vmcoreinfo_data[] array is not used outside of kernel/kexec.c, and
can therefore become static. This patch adds the relevant keyword to the
definition of the array.

Noticed by sparse.

Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:04 -07:00
Neil Horman
04d491ab2a kexec: add dmesg log symbols to /proc/vmcoreinfo lists
It would be nice to be able to extract the dmesg log from a vmcore file
without needing to keep the debug symbols for the running kernel handy all
the time.  We have a facility to do this in /proc/vmcore.  This patch adds
the log_buf and log_end symbols to the vmcoreinfo area so that tools (like
makedumpfile) can easily extract the dmesg logs from a vmcore image.

[akpm@linux-foundation.org: several fixes and cleanups]
[akpm@linux-foundation.org: fix unused log_buf_kexec_setup()]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Simon Horman <horms@verge.net.au>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:04 -07:00
Oleg Nesterov
1b0f7ffd0e pids: kill signal_struct-> __pgrp/__session and friends
We are wasting 2 words in signal_struct without any reason to implement
task_pgrp_nr() and task_session_nr().

task_session_nr() has no callers since
2e2ba22ea4, we can remove it.

task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and
fs/coda.

This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills
__pgrp/__session and the related helpers.

The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense
anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Alan Cox <number6@the-village.bc.nu>		[tty parts]
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:02 -07:00
Oleg Nesterov
52ee2dfdd4 pids: refactor vnr/nr_ns helpers to make them safe
Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy.

task_pid_nr_ns(task) needs rcu/tasklist depending on task == current.

As for "special" pids, vnr/nr_ns helpers always need rcu.  However, if
task != current, they are unsafe even under rcu lock, we can't trust
task->group_leader without the special checks.

And almost every helper has a callsite which needs a fix.

Also, it is a bit annoying that the implementations of, say,
task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical".

This patch introduces the new helper, __task_pid_nr_ns(), which is always
safe to use, and turns all other helpers into the trivial wrappers.

After this I'll send another patch which converts task_tgid_xxx() as well,
they're are a bit special.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:02 -07:00
Oleg Nesterov
2ae448efc8 pids: improve get_task_pid() to fix the unsafe sys_wait4()->task_pgrp()
sys_wait4() does get_pid(task_pgrp(current)), this is not safe.  We can
add rcu lock/unlock around, but we already have get_task_pid() which can
be improved to handle the special pids in more reliable manner.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:02 -07:00
Matthew Wilcox
8e654fba4a sysctl: fix suid_dumpable and lease-break-time sysctls
Arne de Bruijn points out that commit
76fdbb25f9 ("coredump masking: bound
suid_dumpable sysctl") mistakenly limits lease-break-time instead of
suid_dumpable.

Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Reported-by: Arne de Bruijn <kernelbt@arbruijn.dds.nl>
Cc: Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:01 -07:00
Serge E. Hallyn
11dea19009 proc_sysctl: use CONFIG_PROC_SYSCTL around ipc and utsname proc_handlers
As pointed out by Cedric Le Goater (in response to Alexey's original
comment wrt mqns), ipc_sysctl.c and utsname_sysctl.c are using
CONFIG_PROC_FS, not CONFIG_PROC_SYSCTL, to determine whether to define
the proc_handlers.  Change that.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:01 -07:00
Lai Jiangshan
2355b70fd5 workqueue: avoid recursion in run_workqueue()
1) lockdep will complain when run_workqueue() performs recursion.

2) The recursive implementation of run_workqueue() means that
   flush_workqueue() and its documentation are inconsistent.  This may
   hide deadlocks and other bugs.

3) The recursion in run_workqueue() will poison cwq->current_work, but
   flush_work() and __cancel_work_timer(), etcetera need a reliable
   cwq->current_work.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:00 -07:00
Oleg Nesterov
1ee1184485 ptrace_untrace: fix the SIGNAL_STOP_STOPPED check
This bug is ancient too. ptrace_untrace() must not resume the task
if the group stop in progress, we should set TASK_STOPPED instead.

Unfortunately, we still have problems here:

	- if the process/thread was traced, SIGNAL_STOP_STOPPED
	  does not necessary means this thread group is stopped.

	- ptrace breaks the bookkeeping of ->group_stop_count.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:00 -07:00
Oleg Nesterov
95a3540da9 ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logic
Another ancient bug. Consider this trivial test-case,

	int main(void)
	{
		int pid = fork();

		if (pid) {
			ptrace(PTRACE_ATTACH, pid, NULL, NULL);
			wait(NULL);
			ptrace(PTRACE_DETACH, pid, NULL, NULL);
		} else {
			pause();
			printf("WE HAVE A KERNEL BUG!!!\n");
		}

		return 0;
	}

the child must not "escape" for sys_pause(), but it can and this was seen
in practice.

This is because ptrace_detach does:

	if (!child->exit_state)
		wake_up_process(child);

this wakeup can happen after this child has already restarted sys_pause(),
because it gets another wakeup from ptrace_untrace().

With or without this patch, perhaps sys_pause() needs a fix.  But this
wakeup also breaks the SIGNAL_STOP_STOPPED logic in ptrace_untrace().

Remove this wakeup.  The caller saw this task in TASK_TRACED state, and
unless it was SIGKILL'ed in between __ptrace_unlink()->ptrace_untrace()
should handle this case correctly.  If it was SIGKILL'ed, we don't need to
wakup the dying tracee too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:00 -07:00
Oleg Nesterov
5dfc80be73 forget_original_parent: do not abuse child->ptrace_entry
By discussion with Roland.

- Use ->sibling instead of ->ptrace_entry to chain the need to be
  release_task'd childs. Nobody else can use ->sibling, this task
  is EXIT_DEAD and nobody can find it on its own list.

- rename ptrace_dead to dead_childs.

- Now that we don't have the "parallel" untrace code, change back
  reparent_thread() to return void, pass dead_childs as an argument.

Actually, I don't understand why do we notify /sbin/init when we
reparent a zombie, probably it is better to reap it unconditionally.

[akpm@linux-foundation.org: s/childs/children/]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Metzger, Markus T" <markus.t.metzger@intel.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:00 -07:00
Oleg Nesterov
39c626ae47 forget_original_parent: split out the un-ptrace part
By discussion with Roland.

- Rename ptrace_exit() to exit_ptrace(), and change it to do all the
  necessary work with ->ptraced list by its own.

- Move this code from exit.c to ptrace.c

- Update the comment in ptrace_detach() to explain the rechecking of
  the child->ptrace.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Metzger, Markus T" <markus.t.metzger@intel.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:05:00 -07:00
Oleg Nesterov
7f5d3652d4 reparent_thread: fix a zombie leak if /sbin/init ignores SIGCHLD
If /sbin/init ignores SIGCHLD and we re-parent a zombie, it is leaked.
reparent_thread() does do_notify_parent() which sets ->exit_signal = -1 in
this case.  This means that nobody except us can reap it, the detached
task is not visible to do_wait().

Change reparent_thread() to return a boolean (like __pthread_detach) to
indicate that the thread is dead and must be released.  Also change
forget_original_parent() to add the child to ptrace_dead list in this
case.

The naming becomes insane, the next patch does the cleanup.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:59 -07:00
Oleg Nesterov
b1442b055c reparent_thread: fix the "is it traced" check
reparent_thread() uses ptrace_reparented() to check whether this thread is
ptraced, in that case we should not notify the new parent.

But ptrace_reparented() is not exactly correct when the reparented thread
is traced by /sbin/init, because forget_original_parent() has already
changed ->real_parent.

Currently, the only problem is the false notification.  But with the next
patch the kernel crash in this (yes, pathological) case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:59 -07:00
Oleg Nesterov
0a967a044a reparent_thread: don't call kill_orphaned_pgrp() if task_detached()
If task_detached(p) == T, then either

  a) p is not the main thread, we will find the group leader on the
     ->children list.

or

  b) p is the group leader but its ->exit_state = EXIT_DEAD.  This
     can only happen when the last sub-thread has died, but in that case
     that thread has already called kill_orphaned_pgrp() from
     exit_notify().

In both cases kill_orphaned_pgrp() looks bogus.

Move the task_detached() check up and simplify the code, this is also
right from the "common sense" pov: we should do nothing with the detached
childs, except move them to the new parent's ->children list.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:59 -07:00
Oleg Nesterov
4576145c1e ptrace: fix possible zombie leak on PTRACE_DETACH
When ptrace_detach() takes tasklist, the tracee can be SIGKILL'ed.  If it
has already passed exit_notify() we can leak a zombie, because a) ptracing
disables the auto-reaping logic, and b) ->real_parent was not notified
about the child's death.

ptrace_detach() should follow the ptrace_exit's logic, change the code
accordingly.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Tested-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:59 -07:00
Oleg Nesterov
b1b4c6799f ptrace: reintroduce __ptrace_detach() as a callee of ptrace_exit()
No functional changes, preparation for the next patch.

Move the "should we release this child" logic into the separate handler,
__ptrace_detach().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:59 -07:00
Oleg Nesterov
6d69cb87f0 ptrace: simplify ptrace_exit()->ignoring_children() path
ignoring_children() takes parent->sighand->siglock and checks
k_sigaction[SIGCHLD] atomically.  But this buys nothing, we can't get the
"really" wrong result even if we race with sigaction(SIGCHLD).  If we read
the "stale" sa_handler/sa_flags we can pretend it was changed right after
the check.

Remove spin_lock(->siglock), and kill "int ign" which caches the result of
ignoring_children() which becomes rather trivial.

Perhaps it makes sense to export this helper, do_notify_parent() can use
it too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:59 -07:00
Oleg Nesterov
95c3eb76dc ptrace: kill __ptrace_detach(), fix ->exit_state check
Move the code from __ptrace_detach() to its single caller and kill this
helper.

Also, fix the ->exit_state check, we shouldn't wake up EXIT_DEAD tasks.
Actually, I think task_is_stopped_or_traced() makes more sense, but this
needs another patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:59 -07:00
Sukadev Bhattiprolu
6588c1e3ff signals: SI_USER: Masquerade si_pid when crossing pid ns boundary
When sending a signal to a descendant namespace, set ->si_pid to 0 since
the sender does not have a pid in the receiver's namespace.

Note:
	- If rt_sigqueueinfo() sets si_code to SI_USER when sending a
	  signal across a pid namespace boundary, the value in ->si_pid
	  will be cleared to 0.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu
b3bfa0cba8 signals: protect cinit from blocked fatal signals
Normally SIG_DFL signals to global and container-init are dropped early.
But if a signal is blocked when it is posted, we cannot drop the signal
since the receiver may install a handler before unblocking the signal.
Once this signal is queued however, the receiver container-init has no way
of knowing if the signal was sent from an ancestor or descendant
namespace.  This patch ensures that contianer-init drops all SIG_DFL
signals in get_signal_to_deliver() except SIGKILL/SIGSTOP.

If SIGSTOP/SIGKILL originate from a descendant of container-init they are
never queued (i.e dropped in sig_ignored() in an earler patch).

If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued
and container-init processes the signal.

IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global
or container-init, the signal must have been generated internally or must
have come from an ancestor ns and we process the signal.

Further, the signal_group_exit() check was needed to cover the case of a
multi-threaded init sending SIGKILL to other threads when doing an exit()
or exec().  But since the new sig_kernel_only() check covers the SIGKILL,
the signal_group_exit() check is no longer needed and can be removed.

Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for
container-inits.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu
e4da026f98 signals: zap_pid_ns_process() should use force_sig()
send_signal() assumes that signals with SEND_SIG_PRIV are generated from
within the same namespace.  So any nested container-init processes become
immune to the SIGKILL generated by kill_proc_info() in
zap_pid_ns_processes().

Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the
SIGNAL_UNKILLABLE flag ensuring the signal is processed by
container-inits.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu
921cf9f630 signals: protect cinit from unblocked SIG_DFL signals
Drop early any SIG_DFL or SIG_IGN signals to container-init from within
the same container.  But queue SIGSTOP and SIGKILL to the container-init
if they are from an ancestor container.

Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the
container can still terminate the container-init.  That will be addressed
in the next patch.

Note:	To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits
   	in a follow-on patch. Until then, this patch is just a preparatory
	step.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:58 -07:00
Sukadev Bhattiprolu
7978b567d3 signals: add from_ancestor_ns parameter to send_signal()
send_signal() (or its helper) needs to determine the pid namespace of the
sender.  But a signal sent via kill_pid_info_as_uid() comes from within
the kernel and send_signal() does not need to determine the pid namespace
of the sender.  So define a helper for send_signal() which takes an
additional parameter, 'from_ancestor_ns' and have kill_pid_info_as_uid()
use that helper directly.

The 'from_ancestor_ns' parameter will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:58 -07:00
Oleg Nesterov
f008faff0e signals: protect init from unwanted signals more
(This is a modified version of the patch submitted by Oleg Nesterov
http://lkml.org/lkml/2008/11/18/249 and tries to address comments that
came up in that discussion)

init ignores the SIG_DFL signals but we queue them anyway, including
SIGKILL.  This is mostly OK, the signal will be dropped silently when
dequeued, but the pending SIGKILL has 2 bad implications:

        - it implies fatal_signal_pending(), so we confuse things
          like wait_for_completion_killable/lock_page_killable.

        - for the sub-namespace inits, the pending SIGKILL can
          mask (legacy_queue) the subsequent SIGKILL from the
          parent namespace which must kill cinit reliably.
          (preparation, cinits don't have SIGNAL_UNKILLABLE yet)

The patch can't help when init is ptraced, but ptracing of init is not
"safe" anyway.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:58 -07:00
Oleg Nesterov
43918f2bf4 signals: remove 'handler' parameter to tracehook functions
Container-init must behave like global-init to processes within the
container and hence it must be immune to unhandled fatal signals from
within the container (i.e SIG_DFL signals that terminate the process).

But the same container-init must behave like a normal process to processes
in ancestor namespaces and so if it receives the same fatal signal from a
process in ancestor namespace, the signal must be processed.

Implementing these semantics requires that send_signal() determine pid
namespace of the sender but since signals can originate from workqueues/
interrupt-handlers, determining pid namespace of sender may not always be
possible or safe.

This patchset implements the design/simplified semantics suggested by
Oleg Nesterov.  The simplified semantics for container-init are:

	- container-init must never be terminated by a signal from a
	  descendant process.

	- container-init must never be immune to SIGKILL from an ancestor
	  namespace (so a process in parent namespace must always be able
	  to terminate a descendant container).

	- container-init may be immune to unhandled fatal signals (like
	  SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP
	  are the only reliable signals to a container-init from ancestor
	  namespace.

This patch:

Based on an earlier patch submitted by Oleg Nesterov and comments from
Roland McGrath (http://lkml.org/lkml/2008/11/19/258).

The handler parameter is currently unused in the tracehook functions.
Besides, the tracehook functions are called with siglock held, so the
functions can check the handler if they later need to.

Removing the parameter simiplifies changes to sig_ignored() in a follow-on
patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:58 -07:00
Oleg Nesterov
90bc8d8b1a do_wait: fix waiting for the group stop with the dead leader
do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is
not true if the leader is already dead.  Check SIGNAL_STOP_STOPPED instead
and use signal->group_exit_code.

Trivial test-case:

	void *tfunc(void *arg)
	{
		pause();
		return NULL;
	}

	int main(void)
	{
		pthread_t thr;
		pthread_create(&thr, NULL, tfunc, NULL);
		pthread_exit(NULL);
		return 0;
	}

It doesn't react to ^Z (and then to ^C or ^\). The task is stopped, but
bash can't see this.

The bug is very old, and it was reported multiple times. This patch was sent
more than a year ago (http://marc.info/?t=119713920000003) but it was ignored.

This change also fixes other oddities (but not all) in this area.  For
example, before this patch:

	$ sleep 100
	^Z
	[1]+  Stopped                 sleep 100
	$ strace -p `pidof sleep`
	Process 11442 attached - interrupt to quit

strace hangs in do_wait(), because ->exit_code was already consumed by
bash.  After this patch, strace happily proceeds:

	--- SIGTSTP (Stopped) @ 0 (0) ---
	restart_syscall(<... resuming interrupted call ...>

To me, this looks much more "natural" and correct.

Another example.  Let's suppose we have the main thread M and sub-thread
T, the process is stopped, and its parent did wait(WSTOPPED).  Now we can
ptrace T but not M.  This looks at least strange to me.

Imho, do_wait() should not confuse the per-thread ptrace stops with the
per-process job control stops.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Kaz Kylheku <kkylheku@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:57 -07:00
David Rientjes
6d7b2f5f9e cpusets: prevent PF_THREAD_BOUND tasks from attaching to non-root cpusets
Kthreads that have the PF_THREAD_BOUND bit set in their flags are bound to a
specific cpu.  Thus, their set of allowed cpus shall not change.

This patch prevents such threads from attaching to non-root cpusets.  They do
not have mempolicies that restrict them to a subset of system nodes and, since
their cpumask may never change, they cannot use any of the features of
cpusets.

The tasks will forever be a member of the root cpuset and will be returned
when listing the tasks attached to that cpuset.

Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:57 -07:00
Paul Menage
db7f47cf48 cpusets: allow cpusets to be configured/built on non-SMP systems
Allow cpusets to be configured/built on non-SMP systems

Currently it's impossible to build cpusets under UML on x86-64, since
cpusets depends on SMP and x86-64 UML doesn't support SMP.

There's code in cpusets that doesn't depend on SMP.  This patch surrounds
the minimum amount of cpusets code with #ifdef CONFIG_SMP in order to
allow cpusets to build/run on UP systems (for testing purposes under UML).

Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:57 -07:00
David Rientjes
a1bc5a4eee cpusets: replace zone allowed functions with node allowed
The cpuset_zone_allowed() variants are actually only a function of the
zone's node.

Cc: Paul Menage <menage@google.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:57 -07:00
Li Zefan
7f81b1ae18 cpuset: remove struct cpuset_hotplug_scanner
Use cgroup_scanner.data, instead of introducing cpuset_hotplug_scanner.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:57 -07:00
Li Zefan
010cfac4ca cpuset: avoid changing cpuset's mems when errno returned
When writing to cpuset.mems, cpuset has to update its mems_allowed before
calling update_tasks_nodemask(), but this function might return -ENOMEM.

To avoid this rare case, we allocate the memory before changing
mems_allowed, and then pass to update_tasks_nodemask().  Similar to what
update_cpumask() does.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:57 -07:00
Li Zefan
3b6766fe66 cpuset: rewrite update_tasks_nodemask()
This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's
mems_allowed.

Not only simplify the code largely, but also avoid allocating an array to
hold mm pointers of all the tasks in the cpuset.  This array can be big
(size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a
chance to fail the allocation when under memory stress.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:57 -07:00
Li Zefan
0b4217b3fd cpuset: fix possible races in cpu/memory hotplug
Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected
by callback_mutex, otherwise the reader may read wrong cpus/mems. This is
cpuset's lock rule.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:56 -07:00
KAMEZAWA Hiroyuki
0b7f569e45 memcg: fix OOM killer under memcg
This patch tries to fix OOM Killer problems caused by hierarchy.
Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
kill a task in memcg.

But, when hierarchy is used, it's broken and correct task cannot
be killed. For example, in following cgroup

	/groupA/	hierarchy=1, limit=1G,
		01	nolimit
		02	nolimit
All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

This patch provides makes the bad process be selected from all tasks
under hierarchy. BTW, currently, oom_jiffies is updated against groupA
in above case. oom_jiffies of tree should be updated.

To see how oom_jiffies is used, please check mem_cgroup_oom_called()
callers.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: const fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:55 -07:00
Li Zefan
d969fbe69e debug cgroup: remove unneeded cgroup_lock
Since we are in cgroup write handler, so the cgrp is valid, so we don't
have to hold cgroup_mutex when calling cgroup_task_count().  One similar
example is in cgroup_tasks_open().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:54 -07:00
Li Zefan
0670e08bdf cgroups: don't change release_agent when remount failed
Remount can fail in either case:
  - wrong mount options is specified, or option 'noprefix' is changed.
  - a to-be-added subsys is already mounted/active.

When using remount to change 'release_agent', for the above former failure
case, remount will return errno with release_agent unchanged, but for the
latter case, remount will return EBUSY with relase_agent changed, which is
unexpected I think:

 # mount -t cgroup -o cpu xxx /cgrp1
 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
 # cat /cgrp2/release_agent
 agent1
 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 not mounted already, or bad option
 # cat /cgrp2/release_agent
 agent1     <-- ok
 # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 is busy
 # cat /cgrp2/release_agent
 agent2     <-- unexpected!

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:54 -07:00
Li Zefan
099fca3225 cgroups: show correct file mode
We have some read-only files and write-only files, but currently they are
all set to 0644, which is counter-intuitive and cause trouble for some
cgroup tools like libcgroup.

This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
own files' file mode, and for the most cases cft->mode can be default to 0
and cgroup will figure out proper mode.

Acked-by: Paul Menage <menage@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:54 -07:00
Jesper Juhl
66bdc9cfc7 kernel/cgroup.c: kfree(NULL) is legal
Reduces object file size a bit:

Before:
$ size kernel/cgroup.o
   text    data     bss     dec     hex filename
  21593    7804    4924   34321    8611 kernel/cgroup.o
After:
$ size kernel/cgroup.o
   text    data     bss     dec     hex filename
  21537    7744    4924   34205    859d kernel/cgroup.o

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:54 -07:00
KAMEZAWA Hiroyuki
ec64f51545 cgroup: fix frequent -EBUSY at rmdir
In following situation, with memory subsystem,

	/groupA use_hierarchy==1
		/01 some tasks
		/02 some tasks
		/03 some tasks
		/04 empty

When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
is triggered and the kernel walks tree under groupA. In this case,
rmdir /groupA/04 fails with -EBUSY frequently because of temporal
refcnt from the kernel.

In general. cgroup can be rmdir'd if there are no children groups and
no tasks. Frequent fails of rmdir() is not useful to users.
(And the reason for -EBUSY is unknown to users.....in most cases)

This patch tries to modify above behavior, by
	- retries if css_refcnt is got by someone.
	- add "return value" to pre_destroy() and allows subsystem to
	  say "we're really busy!"

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:54 -07:00
KAMEZAWA Hiroyuki
38460b48d0 cgroup: CSS ID support
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.

This patch attaches unique ID to each css and provides following.

 - css_lookup(subsys, id)
   returns pointer to struct cgroup_subysys_state of id.
 - css_get_next(subsys, id, rootid, depth, foundid)
   returns the next css under "root" by scanning

When cgroup_subsys->use_id is set, an id for css is maintained.

The cgroup framework only parepares
	- css_id of root css for subsys
	- id is automatically attached at creation of css.
	- id is *not* freed automatically. Because the cgroup framework
	  don't know lifetime of cgroup_subsys_state.
	  free_css_id() function is provided. This must be called by subsys.

There are several reasons to develop this.
	- Saving space .... For example, memcg's swap_cgroup is array of
	  pointers to cgroup. But it is not necessary to be very fast.
	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
	  reduce much amount of memory usage.

	- Scanning without lock.
	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
	  css under root can be written without locks.
	  ex)
	  do {
		rcu_read_lock();
		next = cgroup_get_next(subsys, id, root, &found);
		/* check sanity of next here */
		css_tryget();
		rcu_read_unlock();
		id = found + 1
	 } while(...)

Characteristics:
	- Each css has unique ID under subsys.
	- Lifetime of ID is controlled by subsys.
	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
	- Allowed ID is 1-65535, ID 0 is UNUSED ID.

Design Choices:
	- scan-by-ID v.s. scan-by-tree-walk.
	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
	  by following kind of routine.
	  scan -> rest a while(release a lock) -> conitunue from interrupted
	  memcg's hierarchical reclaim does this.

	- When subsys->use_id is set, # of css in the system is limited to
	  65535.

[bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:53 -07:00
Grzegorz Nosek
313e924c08 cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups
The ns_proxy cgroup allows moving processes to child cgroups only one
level deep at a time.  This commit relaxes this restriction and makes it
possible to attach tasks directly to grandchild cgroups, e.g.:

($pid is in the root cgroup)
echo $pid > /cgroup/CG1/CG2/tasks

Previously this operation would fail with -EPERM and would have to be
performed as two steps:
echo $pid > /cgroup/CG1/tasks
echo $pid > /cgroup/CG1/CG2/tasks

Also, the target cgroup no longer needs to be empty to move a task there.

Signed-off-by: Grzegorz Nosek <root@localdomain.pl>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:53 -07:00
Alexey Dobriyan
6f2c55b843 Simplify copy_thread()
First argument unused since 2.3.11.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:51 -07:00
David Howells
33e5d76979 nommu: fix a number of issues with the per-MM VMA patch
Fix a number of issues with the per-MM VMA patch:

 (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
     a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
     system.

 (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
     lest it overflow.

 (3) Move the allocation of the vm_area_struct slab back for fork.c.

 (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

 (5) Use BUG_ON() rather than if () BUG().

 (6) Make the default validate_nommu_regions() a static inline rather than a
     #define.

 (7) Make free_page_series()'s objection to pages with a refcount != 1 more
     informative.

 (8) Adjust the __put_nommu_region() banner comment to indicate that the
     semaphore must be held for writing.

 (9) Limit the number of warnings about munmaps of non-mmapped regions.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 19:04:48 -07:00
Kees Cook
3d43321b70 modules: sysctl to block module loading
Implement a sysctl file that disables module-loading system-wide since
there is no longer a viable way to remove CAP_SYS_MODULE after the system
bounding capability set was removed in 2.6.25.

Value can only be set to "1", and is tested only if standard capability
checks allow CAP_SYS_MODULE.  Given existing /dev/mem protections, this
should allow administrators a one-way method to block module loading
after initial boot-time module loading has finished.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-04-03 11:47:11 +11:00
Darren Hart
cd84a42f31 futex: comment requeue key reference semantics
We've tripped over the futex_requeue drop_count refering to key2
instead of key1.  The code is actually correct, but is non-intuitive.
This patch adds an explicit comment explaining the requeue.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-02 23:39:53 +02:00
Randy Dunlap
633fe795b8 timers: add missing kernel-doc
Add missing kernel-doc parameter notation and change function
name to its new name:

  Warning(kernel/timer.c:543): No description found for parameter 'name'
  Warning(kernel/timer.c:543): No description found for parameter 'key'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: akpm <akpm@linux-foundation.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
LKML-Reference: <20090401174723.f0bea0eb.randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-02 05:54:00 +02:00
Ingo Molnar
8302294f43 Merge branch 'tracing/core-v2' into tracing-for-linus
Conflicts:
	include/linux/slub_def.h
	lib/Kconfig.debug
	mm/slob.c
	mm/slub.c
2009-04-02 00:49:02 +02:00
Davide Libenzi
4ede816ac3 epoll keyed wakeups: add __wake_up_locked_key() and __wake_up_sync_key()
This patchset introduces wakeup hints for some of the most popular (from
epoll POV) devices, so that epoll code can avoid spurious wakeups on its
waiters.

The problem with epoll is that the callback-based wakeups do not, ATM,
carry any information about the events the wakeup is related to.  So the
only choice epoll has (not being able to call f_op->poll() from inside the
callback), is to add the file* to a ready-list and resolve the real events
later on, at epoll_wait() (or its own f_op->poll()) time.  This can cause
spurious wakeups, since the wake_up() itself might be for an event the
caller is not interested into.

The rate of these spurious wakeup can be pretty high in case of many
network sockets being monitored.

By allowing devices to report the events the wakeups refer to (at least
the two major classes - POLLIN/POLLOUT), we are able to spare useless
wakeups by proper handling inside the epoll's poll callback.

Epoll will have in any case to call f_op->poll() on the file* later on,
since the change to be done in order to have the full event set sent via
wakeup, is too invasive for the way our f_op->poll() system works (the
full event set is calculated inside the poll function - there are too many
of them to even start thinking the change - also poll/select would need
change too).

Epoll is changed in a way that both devices which send event hints, and
the ones that don't, are correctly handled.  The former will gain some
efficiency though.

As a general rule for devices, would be to add an event mask by using
key-aware wakeup macros, when making up poll wait queues.  I tested it
(together with the epoll's poll fix patch Andrew has in -mm) and wakeups
for the supported devices are correctly filtered.

Test program available here:

http://www.xmailserver.org/epoll_test.c

This patch:

Nothing revolutionary here.  Just using the available "key" that our
wakeup core already support.  The __wake_up_locked_key() was no brainer,
since both __wake_up_locked() and __wake_up_locked_key() are thin wrappers
around __wake_up_common().

The __wake_up_sync() function had a body, so the choice was between
borrowing the body for __wake_up_sync_key() and calling it from
__wake_up_sync(), or make an inline and calling it from both.  I chose the
former since in most archs it all resolves to "mov $0, REG; jmp ADDR".

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:20 -07:00
Magnus Damm
a8af78982f pm: rework includes, remove arch ifdefs
Make the following header file changes:

 - remove arch ifdefs and asm/suspend.h from linux/suspend.h
 - add asm/suspend.h to disk.c (for arch_prepare_suspend())
 - add linux/io.h to swsusp.c (for ioremap())
 - x86 32/64 bit compile fixes

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:16 -07:00
Alexey Dobriyan
704503d836 mm: fix proc_dointvec_userhz_jiffies "breakage"
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838

On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange
way from the user's point of view:

	# echo 500 >/proc/sys/vm/dirty_writeback_centisecs
	# cat /proc/sys/vm/dirty_writeback_centisecs
	499

So, we have 5000 jiffies converted to only 499 clock ticks and reported
back.

TICK_NSEC = 999848
ACTHZ = 256039

Keeping in-kernel variable in units passed from userspace will fix issue
of course, but this probably won't be right for every sysctl.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:13 -07:00
KOSAKI Motohiro
ee99c71c59 mm: introduce for_each_populated_zone() macro
Impact: cleanup

In almost cases, for_each_zone() is used with populated_zone().  It's
because almost function doesn't need memoryless node information.
Therefore, for_each_populated_zone() can help to make code simplify.

This patch has no functional change.

[akpm@linux-foundation.org: small cleanup]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:11 -07:00
Gautham R Shenoy
46e0bb9c12 sched: Print sched_group::__cpu_power in sched_domain_debug
Impact: extend debug info /proc/sched_debug

If the user changes the value of the sched_mc/smt_power_savings sysfs
tunable, it'll trigger a rebuilding of the whole sched_domain tree,
with the SD_POWERSAVINGS_BALANCE flag set at certain levels.

As a result, there would be a change in the __cpu_power of sched_groups
in the sched_domain hierarchy.

Print the __cpu_power values for each sched_group in sched_domain_debug
to help verify this change and correlate it with the change in the
load-balancing behavior.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090330045520.2869.24777.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 17:58:03 +02:00
Bharata B Rao
ef12fefabf cpuacct: add per-cgroup utime/stime statistics
Add per-cgroup cpuacct controller statistics like the system and user
time consumed by the group of tasks.

Changelog:

v7
- Changed the name of the statistic from utime to user and from stime to
  system so that in future we could easily add other statistics like irq,
  softirq, steal times etc easily.

v6
- Fixed a bug in the error path of cpuacct_create() (pointed by Li Zefan).

v5
- In cpuacct_stats_show(), use cputime64_to_clock_t() since we are
  operating on a 64bit variable here.

v4
- Remove comments in cpuacct_update_stats() which explained why rcu_read_lock()
  was needed (as per Peter Zijlstra's review comments).
- Don't say that percpu_counter_read() is broken in Documentation/cpuacct.txt
  as per KAMEZAWA Hiroyuki's review comments.

v3
- Fix a small race in the cpuacct hierarchy walk.

v2
- stime and utime now exported in clock_t units instead of msecs.
- Addressed the code review comments from Balbir and Li Zefan.
- Moved to -tip tree.

v1
- Moved the stime/utime accounting to cpuacct controller.

Earlier versions
- http://lkml.org/lkml/2009/2/25/129

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
LKML-Reference: <20090331043222.GA4093@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 16:49:38 +02:00
Hidetoshi Seto
c5f8d99585 posixtimers, sched: Fix posix clock monotonicity
Impact: Regression fix (against clock_gettime() backwarding bug)

This patch re-introduces a couple of functions, task_sched_runtime
and thread_group_sched_runtime, which was once removed at the
time of 2.6.28-rc1.

These functions protect the sampling of thread/process clock with
rq lock.  This rq lock is required not to update rq->clock during
the sampling.

i.e.
  The clock_gettime() may return
   ((accounted runtime before update) + (delta after update))
  that is less than what it should be.

v2 -> v3:
	- Rename static helper function __task_delta_exec()
	  to do_task_delta_exec() since -tip tree already has
	  a __task_delta_exec() of different version.

v1 -> v2:
	- Revises comments of function and patch description.
	- Add note about accuracy of thread group's runtime.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org	[2.6.28.x][2.6.29.x]
LKML-Reference: <49D1CC93.4080401@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 16:44:16 +02:00
Steven Rostedt
2e572895bf ring-buffer: do not remove reader page from list on ring buffer free
Impact: prevent possible memory leak

The reader page of the ring buffer is special. Although it points
into the ring buffer, it is not part of the actual buffer. It is
a page used by the reader to swap with a page in the ring buffer.
Once the swap is made, the new reader page is again outside the
buffer.

Even though the reader page points into the buffer, it is really
pointing to residual data. Note, this data is used by the reader.

              reader page
                  |
                  v
       (prev)   +---+    (next)
     +----------|   |----------+
     |          +---+          |
     v                         v
   +---+        +---+        +---+
-->|   |------->|   |------->|   |--->
<--|   |<-------|   |<-------|   |<---
   +---+        +---+        +---+

     ^            ^            ^
      \           |            /
       ------- Buffer---------

If we perform a list_del_init() on the reader page we will actually remove
the last page the reader swapped with and not the reader page itself.
This will cause that page to not be freed, and thus is a memory leak.

Luckily, the only user of the ring buffer so far is ftrace. And ftrace
will not free its ring buffer after it allocates it. There is no current
possible memory leak. But once there are other users, or if ftrace
dynamically creates and frees its ring buffer, then this would be a
memory leak.

This patch fixes the leak for future cases.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 14:47:53 +02:00
Steven Rostedt
2aad1b76e6 function-graph: allow unregistering twice
Impact: fix to permanent disabling of function graph tracer

There should be nothing to prevent a tracer from unregistering a
function graph callback more than once. This can simplify error paths.

But currently, the counter does not account for mulitple unregistering
of the function graph callback. If it happens, the function graph
tracer will be permanently disabled.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 14:47:42 +02:00
Rusty Russell
13b8bd0a57 sched_rt: don't allocate cpumask in fastpath
Impact: cleanup

As pointed out by Steven Rostedt.  Since the arg in question is
unused, we simply change cpupri_find() to accept NULL.

Reported-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <200903251501.22664.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-01 13:24:51 +02:00
Al Viro
5ad4e53bd5 Get rid of indirect include of fs_struct.h
Don't pull it in sched.h; very few files actually need it and those
can include directly.  sched.h itself only needs forward declaration
of struct fs_struct;

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:27 -04:00
Al Viro
498052bba5 New locking/refcounting for fs_struct
* all changes of current->fs are done under task_lock and write_lock of
  old fs->lock
* refcount is not atomic anymore (same protection)
* its decrements are done when removing reference from current; at the
  same time we decide whether to free it.
* put_fs_struct() is gone
* new field - ->in_exec.  Set by check_unsafe_exec() if we are trying to do
  execve() and only subthreads share fs_struct.  Cleared when finishing exec
  (success and failure alike).  Makes CLONE_FS fail with -EAGAIN if set.
* check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
  is in progress.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:26 -04:00
Al Viro
3e93cd6718 Take fs_struct handling to new file (fs/fs_struct.c)
Pure code move; two new helper functions for nfsd and daemonize
(unshare_fs_struct() and daemonize_fs_struct() resp.; for now -
the same code as used to be in callers).  unshare_fs_struct()
exported (for nfsd, as copy_fs_struct()/exit_fs() used to be),
copy_fs_struct() and exit_fs() don't need exports anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:26 -04:00
Al Viro
11d06b2a1e Kill unsharing fs_struct in __set_personality()
That's a rudiment of altroot support.  I.e. it should've been buried
a long time ago.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:25 -04:00
Bharata B Rao
a18b83b7ef cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2
Impact: fix cgroups race under rcu-preempt

cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
This can race with the task's movement between cgroups. This race
can cause an access to freed ca pointer in cpuacct_charge() or access
to invalid cgroups pointer of the task. This will not happen with rcu or
tree rcu as cpuacct_charge() is called with preemption disabled. However if
rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.

Fix this race by explicitly protecting ca and the hierarchy walk with
rcu_read_lock().

Changes for v2:

 - Update patch descrition (as per Li Zefan's review comments).

 - Remove comments in cpuacct_charge() which explained why rcu_read_lock()
   was needed (as per Peter Zijlstra's review comments).

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 18:27:59 +02:00
Ingo Molnar
8b54e45b00 Merge branches 'tracing/docs', 'tracing/filters', 'tracing/ftrace', 'tracing/kprobes', 'tracing/blktrace-v2' and 'tracing/textedit' into tracing/core-v2 2009-03-31 17:46:40 +02:00
Li Zefan
b14b70a6a4 trace: make argument 'mem' of trace_seq_putmem() const
Impact: fix build warning

I passed a const value to trace_seq_putmem(), and I got compile warning.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:45:13 +02:00
Eduard - Gabriel Munteanu
f285901bb2 tracing: add missing 'extern' keywords to trace_output.h
Impact: cleanup

Many declarations within trace_output.h are missing the 'extern' keyword
in an inconsistent manner. This adds 'extern' where it should be.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:45:07 +02:00
Eduard - Gabriel Munteanu
bdd6df6af9 tracing: provide trace_seq_reserve()
trace_seq_reserve() allows a caller to reserve space in a trace_seq and
write directly into it. This makes it easier to export binary data to
userspace via the tracing interface, by simply filling in a struct.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:45:02 +02:00
Li Zefan
18cea4591a blktrace: print out BLK_TN_MESSAGE properly
Impact: improve ftrace plugin output

Before this patch:

 # cat trace
         make-5383  [001]   741.240059:   8,7    P   N [make]
 __trace_note_message: cfq1074

 # echo 1 > options/blk_classic
 # cat trace
   8,7    1     0.692221252     0  C   W 130411392 + 1024 [0]
 Bad pc action 6361
 Bad pc action 283d

 # echo 0 > options/blk_classic
 # echo bin > trace_options
 # cat trace_pipe | blkparse -i -
 (can't parse messages generated by blk_add_trace_msg())

After this patch:
 # cat trace
      <idle>-0     [001]   187.600933:   8,7    C   W 145220224 + 8 [0]
      <idle>-0     [001]   187.600946:   8,7    m   N cfq1076 complete

 # echo 1 > options/blk_classic
 # cat trace
   8,7    1     0.256378996   238  I   W 113190728 + 8 [pdflush]
   8,7    1     0.256378998   238  m   N cfq1076 insert_request

 # echo 0 > options/blk_classic
 # echo bin > trace_options
 # cat trace_pipe | blkparse -i -
  8,7    1        0    22.973250293     0  C   W 102770576 + 8 [0]
  8,7    1        0    22.973259213     0  m   N cfq1076 complete

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:29:11 +02:00
Li Zefan
b6a4b0c3ad blktrace: extract duplidate code
Impact: cleanup

blk_trace_event_print() and blk_tracer_print_line() share most of the code.

   text    data     bss     dec     hex filename
   8605     393      12    9010    2332 kernel/trace/blktrace.o.orig
   text    data     bss     dec     hex filename
   8555     393      12    8960    2300 kernel/trace/blktrace.o

This patch also prepares for the next patch, that prints out BLK_TN_MESSAGE.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:28:59 +02:00
Li Zefan
ad5dd5493a blktrace: fix memory leak when freeing struct blk_io_trace
Impact: fix mixed ioctl and ftrace-plugin blktrace use memory leak

When mixing the use of ioctl-based blktrace and ftrace-based blktrace,
we can leak memory in this way:

  # btrace /dev/sda > /dev/null &
  # echo 0 > /sys/block/sda/sda1/trace/enable

now we leak bt->dropped_file, bt->msg_file, bt->rchan...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:27:58 +02:00
Li Zefan
17ba97e347 blktrace: fix blk_probes_ref chaos
Impact: fix mixed ioctl and ftrace-plugin blktrace use refcount bugs

ioctl-based blktrace allocates bt and registers tracepoints when
ioctl(BLKTRACESETUP), and do all cleanups when ioctl(BLKTRACETEARDOWN).

while ftrace-based blktrace allocates/frees bt when:
  # echo 1/0 > /sys/block/sda/sda1/trace/enable

and registers/unregisters tracepoints when:
  # echo blk/nop > /debugfs/tracing/current_tracer
or
  # echo 1/0 > /debugfs/tracing/tracing_enable

The separatation of allocation and registeration causes 2 problems:

  1. current user-space blktrace still calls ioctl(TEARDOWN) when
     ioctl(SETUP) failed:
       # echo 1 > /sys/block/sda/sda1/trace/enable
       # blktrace /dev/sda
         BLKTRACESETUP: Device or resource busy
         ^C
     and now blk_probes_ref == -1

  2. Another way to make blk_probes_ref == -1:
     # plugin sdb && mount sdb1
     # echo 1 > /sys/block/sdb/sdb1/trace/enable
     # remove sdb

This patch does the allocation and registeration when writing
sdaX/trace/enable.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:27:45 +02:00
Li Zefan
35ac51bfe4 blktrace: make classic output more classic
Impact: fix ftrace plugin timestamp output

In the classic user-space blktrace, the output timestamp is sec.nsec
not sec.usec.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:27:36 +02:00
Li Zefan
eb08f8eb06 blktrace: fix off-by-one bug
'what' is used as the index of array what2act, so it can't >= the array size.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:27:18 +02:00
Li Zefan
5554720482 blktrace: fix the original blktrace
Currently the original blktrace, which is using relay and is used via
ioctl, is broken. You can use ftrace to see the output of blktrace,
but user-space blktrace is unusable.

It's broken by "blktrace: add ftrace plugin"
(c71a896154)

 -	if (unlikely(bt->trace_state != Blktrace_running))
 +	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
		return;

With this patch, both ioctl and ftrace can be used, but of course you
can't use both of them at the same time.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:27:08 +02:00
Li Zefan
b5230b56ee blktrace: fix a race when creating blk_tree_root in debugfs
t1                                t2
------                            ------
do_blk_trace_setup()              do_blk_trace_setup()
  if (!blk_tree_root) {
                                    if (!blk_tree_root)
    blk_tree_root = create_dir()
                                      blk_tree_root = create_dir();
                                      (now blk_tree_root == NULL)
  ...
  dir = create_dir(name, blk_tree_root);

Due to this race, t1 will create 'dir' in /debugfs but not /debugfs/block.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:26:59 +02:00
Li Zefan
6c051ce030 blktrace: fix timestamp in binary output
I found the timestamp is wrong:

 # echo bin > trace_option
 # echo blk > current_tracer
 # cat trace_pipe | blkparse -i -
 8,0    0        0     0.000000000   504  A   W ...
 ...
 8,7    1        0     0.008534097     0  C   R ...
            (should be 8.534097xxx)

user-space blkparse expects the timestamp to be nanosecond.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 17:26:47 +02:00
Peter Zijlstra
eedeeabdee lockdep: add stack dumps to asserts
Have a better idea about exactly which loc causes a lockdep
limit overflow. Often it's a bug or inefficiency in that
subsystem.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1237376327.5069.253.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 14:53:01 +02:00
Peter Zijlstra
7f1e2ca9f0 hrtimer: fix rq->lock inversion (again)
It appears I inadvertly introduced rq->lock recursion to the
hrtimer_start() path when I delegated running already expired
timers to softirq context.

This patch fixes it by introducing a __hrtimer_start_range_ns()
method that will not use raise_softirq_irqoff() but
__raise_softirq_irqoff() which avoids the wakeup.

It then also changes schedule() to check for pending softirqs and
do the wakeup then, I'm not quite sure I like this last bit, nor
am I convinced its really needed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
LKML-Reference: <20090313112301.096138802@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 14:52:52 +02:00
Ingo Molnar
7bee946358 Merge branch 'linus' into locking-for-linus
Conflicts:
	lib/Kconfig.debug
2009-03-31 13:53:43 +02:00
Rusty Russell
558f6ab910 Merge branch 'cpumask-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
Conflicts:

	arch/x86/include/asm/topology.h
	drivers/oprofile/buffer_sync.c
(Both cases: changed in Linus' tree, removed in Ingo's).
2009-03-31 13:33:50 +10:30
Rusty Russell
49502677e1 module: use strstarts()
Impact: minor cleanup.

I'm not going to neaten anyone else's code, but I'm happy to clean up
my own.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:37 +10:30
Rusty Russell
e91defa26c module: don't use stop_machine on module load
Kay Sievers <kay.sievers@vrfy.org> discovered that boot times are slowed
by about half a second because all the stop_machine_create() calls,
and he only probes about 40 modules (I have 125 loaded on this laptop).

We only do stop_machine_create() so we can unlink the module if
something goes wrong, but it's overkill (and buggy anyway: if
stop_machine_create() fails we still call stop_machine_destroy()).

Since we are only protecting against kallsyms (esp. oops) walking the
list, synchronize_sched() is sufficient (synchronize_rcu() is probably
sufficient, but we're not in a hurry).

Kay says of this patch:
	... no module takes more than 40 millisecs to link now, most of
	them are between 3 and 8 millisecs.

	That looks very different to the numbers without this patch
	and the otherwise same setup, where we get heavy noise in the
	traces and many delays of up to 200 millisecs until linking,
	most of them taking 30+ millisecs.

Tested-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:35 +10:30
Arjan van de Ven
acae051565 module: create a request_module_nowait()
There seems to be a common pattern in the kernel where drivers want to
call request_module() from inside a module_init() function. Currently
this would deadlock.

As a result, several drivers go through hoops like scheduling things via
kevent, or creating custom work queues (because kevent can deadlock on them).

This patch changes this to use a request_module_nowait() function macro instead,
which just fires the modprobe off but doesn't wait for it, and thus avoids the
original deadlock entirely.

On my laptop this already results in one less kernel thread running..

(Includes Jiri's patch to use enum umh_wait)

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (bool-ified)
Cc: Jiri Slaby <jirislaby@gmail.com>
2009-03-31 13:05:35 +10:30
Rusty Russell
8c8ef42aee module: include other structures in module version check
With CONFIG_MODVERSIONS, we version 'struct module' using a dummy
export, but other things matter too:

1) 'struct modversion_info' determines the layout of the __versions section,
2) 'struct kernel_param' determines the layout of the __params section,
3) 'struct kernel_symbol' determines __ksymtab*.
4) 'struct marker' determines __markers.
5) 'struct tracepoint' determines __tracepoints.

So we rename 'struct_module' to 'module_layout' and include these in
the signature.  Now it's general we can add others later on without
confusion.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:34 +10:30
Rusty Russell
9cb610d8e3 module: remove the SHF_ALLOC flag on the __versions section.
Impact: reduce kernel memory usage

This patch just takes off the SHF_ALLOC flag on __versions so we don't
keep them around after module load.

This saves about 7% of module memory if CONFIG_MODVERSIONS=y.

Cc: Shawn Bohrer <shawn.bohrer@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:34 +10:30
Rusty Russell
c6e665c8f0 module: clarify the force-loading taint message.
Impact: Message cleanup

Two of three callers of try_to_force_load() are not because of a
missing version, so change the messages:

Old:
	<modname>: no version for "magic" found: kernel tainted.
New:
	<modname>: bad vermagic: kernel tainted.

Old:
	<modname>: no version for "nocrc" found: kernel tainted.
New:
	<modname>: no versions for exported symbols: kernel tainted.

Old:
	<modname>: no version for "<symname>" found: kernel tainted.
New:
	<modname>: <symname>: kernel tainted.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:33 +10:30
Tim Abbott
c6b3780191 module: Export symbols needed for Ksplice
Impact: Expose some module.c symbols

Ksplice uses several functions from module.c in order to resolve
symbols and implement dependency handling.  Calling these functions
requires holding module_mutex, so it is exported.

(This is just the module part of a bigger add-exports patch from Tim).

Cc: Anders Kaseorg <andersk@mit.edu>
Cc: Jeff Arnold <jbarnold@mit.edu>
Signed-off-by: Tim Abbott <tabbott@mit.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:33 +10:30
Anders Kaseorg
75a66614db Ksplice: Add functions for walking kallsyms symbols
Impact: New API

kallsyms_lookup_name only returns the first match that it finds.  Ksplice
needs information about all symbols with a given name in order to correctly
resolve local symbols.

kallsyms_on_each_symbol provides a generic mechanism for iterating over the
kallsyms table.

Cc: Jeff Arnold <jbarnold@mit.edu>
Cc: Tim Abbott <tabbott@mit.edu>
Signed-off-by: Anders Kaseorg <andersk@mit.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:32 +10:30
Rusty Russell
a6e6abd575 module: remove module_text_address()
Impact: Replace and remove risky (non-EXPORTed) API

module_text_address() returns a pointer to the module, which given locking
improvements in module.c, is useless except to test for NULL:

1) If the module can't go away, use __module_text_address.
2) Otherwise, just use is_module_text_address().

Cc: linux-mtd@lists.infradead.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:32 +10:30
Rusty Russell
e610499e26 module: __module_address
Impact: New API, cleanup

ksplice wants to know the bounds of a module, not just the module text.

It makes sense to have __module_address.  We then implement
is_module_address and __module_text_address in terms of this (and
change is_module_text_address() to bool while we're at it).

Also, add proper kerneldoc for them all.

Cc: Anders Kaseorg <andersk@mit.edu>
Cc: Jeff Arnold <jbarnold@mit.edu>
Cc: Tim Abbott <tabbott@mit.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:31 +10:30
Tim Abbott
414fd31b25 module: Make find_symbol return a struct kernel_symbol
Impact: Cleanup, internal API change

Ksplice needs access to the kernel_symbol structure in order to support
modifications to the exported symbol table.

Cc: Anders Kaseorg <andersk@mit.edu>
Cc: Jeff Arnold <jbarnold@mit.edu>
Signed-off-by: Tim Abbott <tabbott@mit.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (bugfix and style)
2009-03-31 13:05:31 +10:30
Américo Wang
b10153fe31 kernel/module.c: fix an unused goto label
Impact: cleanup

Label 'free_init' is only used when defined(CONFIG_MODULE_UNLOAD) &&
defined(CONFIG_SMP), so move it inside to shut up gcc.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:30 +10:30
Rusty Russell
e180a6b775 param: fix charp parameters set via sysfs
Impact: fix crash on reading from /sys/module/.../ieee80211_default_rc_algo

The module_param type "charp" simply sets a char * pointer in the
module to the parameter in the commandline string: this is why we keep
the (mangled) module command line around.  But when set via sysfs (as
about 11 charp parameters can be) this memory is freed on the way
out of the write().  Future reads hit random mem.

So we kstrdup instead: we have to check we're not in early commandline
parsing, and we have to note when we've used it so we can reliably
kfree the parameter when it's next overwritten, and also on module
unload.

(Thanks to Randy Dunlap for CONFIG_SYSFS=n fixes)

Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Diagnosed-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-31 13:05:30 +10:30
Linus Torvalds
d17abcd541 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask:
  oprofile: Thou shalt not call __exit functions from __init functions
  cpumask: remove the now-obsoleted pcibus_to_cpumask(): generic
  cpumask: remove cpumask_t from core
  cpumask: convert rcutorture.c
  cpumask: use new cpumask_ functions in core code.
  cpumask: remove references to struct irqaction's mask field.
  cpumask: use mm_cpumask() wrapper: kernel/fork.c
  cpumask: use set_cpu_active in init/main.c
  cpumask: remove node_to_first_cpu
  cpumask: fix seq_bitmap_*() functions.
  cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL
2009-03-30 18:00:26 -07:00
Linus Torvalds
c4e1aa67ed Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (33 commits)
  lockdep: fix deadlock in lockdep_trace_alloc
  lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB
  lockdep: annotate reclaim context (__GFP_NOFS), fix
  lockdep: build fix for !PROVE_LOCKING
  lockstat: warn about disabled lock debugging
  lockdep: use stringify.h
  lockdep: simplify check_prev_add_irq()
  lockdep: get_user_chars() redo
  lockdep: simplify get_user_chars()
  lockdep: add comments to mark_lock_irq()
  lockdep: remove macro usage from mark_held_locks()
  lockdep: fully reduce mark_lock_irq()
  lockdep: merge the !_READ mark_lock_irq() helpers
  lockdep: merge the _READ mark_lock_irq() helpers
  lockdep: simplify mark_lock_irq() helpers #3
  lockdep: further simplify mark_lock_irq() helpers
  lockdep: simplify the mark_lock_irq() helpers
  lockdep: split up mark_lock_irq()
  lockdep: generate usage strings
  lockdep: generate the state bit definitions
  ...
2009-03-30 17:17:35 -07:00
Lai Jiangshan
f69b17d7e7 rcu: rcu_barrier VS cpu_hotplug: Ensure callbacks in dead cpu are migrated to online cpu
cpu hotplug may happen asynchronously, some rcu callbacks are maybe
still on dead cpu, rcu_barrier() also needs to wait for these rcu
callbacks to complete, so we must ensure callbacks in dead cpu are
migrated to online cpu.

Paul E. McKenney's review:

  Good stuff, Lai!!!  Simpler than any of the approaches that I was
  considering, and, better yet, independent of the underlying RCU
  implementation!!!

  I was initially worried that wake_up() might wake only one of two
  possible wait_event()s, namely rcu_barrier() and the CPU_POST_DEAD code,
  but the fact that wait_event() clears WQ_FLAG_EXCLUSIVE avoids that issue.
  I was also worried about the fact that different RCU implementations have
  different mappings of call_rcu(), call_rcu_bh(), and call_rcu_sched(), but
  this is OK as well because we just get an extra (harmless) callback in the
  case that they map together (for example, Classic RCU has call_rcu_sched()
  mapping to call_rcu()).

  Overlap of CPU-hotplug operations is prevented by cpu_add_remove_lock,
  and any stray callbacks that arrive (for example, from irq handlers
  running on the dying CPU) either are ahead of the CPU_DYING callbacks on
  the one hand (and thus accounted for), or happened after the rcu_barrier()
  started on the other (and thus don't need to be accounted for).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49C36476.1010400@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-31 00:09:37 +02:00
Ingo Molnar
65fb0d23fc Merge branch 'linus' into cpumask-for-linus
Conflicts:
	arch/x86/kernel/cpu/common.c
2009-03-30 23:53:32 +02:00
Peter Zijlstra
2f85018152 lockdep: fix deadlock in lockdep_trace_alloc
Heiko reported that we grab the graph lock with irqs enabled.

Fix this by providng the same wrapper as all other lockdep entry
functions have.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
LKML-Reference: <1237544000.24626.52.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-30 23:19:24 +02:00
Rafael J. Wysocki
749b0afc3a kexec: Change kexec jump code ordering
Change the ordering of the kexec jump code so that the nonboot CPUs
are disabled after calling device drivers' "late suspend" methods.

This change reflects the recent modifications of the power management
code that is also used by kexec jump.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-03-30 21:46:55 +02:00
Rafael J. Wysocki
4aecd67189 PM: Change hibernation code ordering
Change the ordering of the hibernation core code so that the platform
"prepare" callbacks are executed and the nonboot CPUs are disabled
after calling device drivers' "late suspend" methods.

This change (along with the previous analogous change of the suspend
core code) will allow us to rework the PCI PM core so that the power
state of devices is changed in the "late" phase of suspend (and
analogously in the "early" phase of resume), which in turn will allow
us to avoid the race condition where a device using shared interrupts
is put into a low power state with interrupts enabled and then an
interrupt (for another device) comes in and confuses its driver.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-03-30 21:46:54 +02:00
Rafael J. Wysocki
900af0d973 PM: Change suspend code ordering
Change the ordering of the suspend core code so that the platform
"prepare" callback is executed and the nonboot CPUs are disabled
after calling device drivers' "late suspend" methods.

This change will allow us to rework the PCI PM core so that the power
state of devices is changed in the "late" phase of suspend (and
analogously in the "early" phase of resume), which in turn will allow
us to avoid the race condition where a device using shared interrupts
is put into a low power state with interrupts enabled and then an
interrupt (for another device) comes in and confuses its driver.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-03-30 21:46:54 +02:00
Rafael J. Wysocki
2ed8d2b3a8 PM: Rework handling of interrupts during suspend-resume
Use the functions introduced in by the previous patch,
suspend_device_irqs(), resume_device_irqs() and check_wakeup_irqs(),
to rework the handling of interrupts during suspend (hibernation) and
resume.  Namely, interrupts will only be disabled on the CPU right
before suspending sysdevs, while device drivers will be prevented
from receiving interrupts, with the help of the new helper function,
before their "late" suspend callbacks run (and analogously during
resume).

In addition, since the device interrups are now disabled before the
CPU has turned all interrupts off and the CPU will ACK the interrupts
setting the IRQ_PENDING bit for them, check in sysdev_suspend() if
any wake-up interrupts are pending and abort suspend if that's the
case.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-03-30 21:46:54 +02:00
Rafael J. Wysocki
0a0c5168df PM: Introduce functions for suspending and resuming device interrupts
Introduce helper functions allowing us to prevent device drivers from
getting any interrupts (without disabling interrupts on the CPU)
during suspend (or hibernation) and to make them start to receive
interrupts again during the subsequent resume.  These functions make it
possible to keep timer interrupts enabled while the "late" suspend and
"early" resume callbacks provided by device drivers are being
executed.  In turn, this allows device drivers' "late" suspend and
"early" resume callbacks to sleep, execute ACPI callbacks etc.

The functions introduced here will be used to rework the handling of
interrupts during suspend (hibernation) and resume.  Namely,
interrupts will only be disabled on the CPU right before suspending
sysdevs, while device drivers will be prevented from receiving
interrupts, with the help of the new helper function, before their
"late" suspend callbacks run (and analogously during resume).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-03-30 21:46:54 +02:00
Matt LaPlante
692105b8ac trivial: fix typos/grammar errors in Kconfig texts
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-03-30 15:22:01 +02:00
Nick Andrew
877d03105d trivial: Fix misspelling of firmware
Fix misspelling of firmware.

Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-03-30 15:21:59 +02:00
Uwe Kleine-Koenig
2a93a1f214 trivial: fix typo "resgister" -> "register"
Signed-off-by: Uwe Kleine-Koenig <ukleinek@strlen.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-03-30 15:21:58 +02:00
Rusty Russell
612a726faf cpumask: remove cpumask_t from core
Impact: cleanup

struct cpumask is nicer, and we use it to make where we've made code
safe for CONFIG_CPUMASK_OFFSTACK=y.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-30 22:05:17 +10:30
Rusty Russell
73d0a4b107 cpumask: convert rcutorture.c
We're getting rid of cpumasks on the stack.

Simply change tmp_mask to a global, and allocate it in
rcu_torture_init().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
2009-03-30 22:05:16 +10:30
Rusty Russell
aa85ea5b89 cpumask: use new cpumask_ functions in core code.
Impact: cleanup

Time to clean up remaining laggards using the old cpu_ functions.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Trond.Myklebust@netapp.com
2009-03-30 22:05:16 +10:30
Rusty Russell
9489424454 cpumask: use mm_cpumask() wrapper: kernel/fork.c
Impact: futureproof

Makes code futureproof against the impending change to mm->cpu_vm_mask.

It's also a chance to use the new cpumask_ ops which take a pointer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-30 22:05:13 +10:30
Rusty Russell
2b17fa506c cpumask: use set_cpu_active in init/main.c
cpu_active_map is deprecated in favor of cpu_active_mask, which is
const for safety: we use accessors now (set_cpu_active) is we really
want to make a change.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-30 22:05:12 +10:30
Rusty Russell
1a2142afa5 cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL
Impact: cleanup

(Thanks to Al Viro for reminding me of this, via Ingo)

CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:

	#define CPU_MASK_ALL (cpumask_t) { { ... } }

Taking the address of such a temporary is questionable at best,
unfortunately 321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:

	#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)

Which formalizes this practice.  One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).

So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR
with the modern "cpu_all_mask" (a real const struct cpumask *).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>
2009-03-30 22:05:11 +10:30
Jeremy Fitzhardinge
224101ed69 x86/paravirt: finish change from lazy cpu to context switch start/end
Impact: fix lazy context switch API

Pass the previous and next tasks into the context switch start
end calls, so that the called functions can properly access the
task state (esp in end_context_switch, in which the next task
is not yet completely current).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2009-03-29 23:36:01 -07:00
Jeremy Fitzhardinge
7fd7d83d49 x86/pvops: replace arch_enter_lazy_cpu_mode with arch_start_context_switch
Impact: simplification, prepare for later changes

Make lazy cpu mode more specific to context switching, so that
it makes sense to do more context-switch specific things in
the callbacks.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2009-03-29 23:35:59 -07:00
Benjamin Herrenschmidt
9ff9a26b78 Merge commit 'origin/master' into next
Manual merge of:
	arch/powerpc/include/asm/elf.h
	drivers/i2c/busses/i2c-mpc.c
2009-03-30 14:04:53 +11:00
Randy Dunlap
d5ac537e5f sched: fix errors in struct & function comments
Fix kernel-doc errors in sched.c:  the structs don't have
kernel-doc notation and the short function description needs to
be one line only.

  Error(kernel/sched.c:3197): cannot understand prototype: 'struct sd_lb_stats '
  Error(kernel/sched.c:3228): cannot understand prototype: 'struct sg_lb_stats '
  Error(kernel/sched.c:3375): duplicate section name 'Description'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-29 08:12:39 -07:00
Linus Torvalds
c31f403de6 Merge branch 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: remove the pointer math from double_unlock_hb, fix
  futex: remove the pointer math from double_unlock_hb
  futex: clean up fault logic
  futex: unlock before returning -EFAULT
  futex: use current->time_slack_ns for rt tasks too
  futex: add double_unlock_hb()
  futex: additional (get|put)_futex_key() fixes
  futex: update futex commentary
2009-03-28 17:32:14 -07:00
Ingo Molnar
38a6ed3ed8 Merge branch 'linus' into core/printk 2009-03-28 23:34:14 +01:00
Ingo Molnar
d00ab2fdd4 Merge branch 'linus' into core/futexes 2009-03-28 23:24:12 +01:00
Linus Torvalds
eedf2c5296 Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-for-30
* git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-for-30:
  fastboot: remove duplicate unpack_to_rootfs()
  ide/net: flip the order of SATA and network init
  async: remove the temporary (2.6.29) "async is off by default" code

Fix up conflicts in init/initramfs.c manually
2009-03-28 14:00:33 -07:00
Arjan van de Ven
9710794383 async: remove the temporary (2.6.29) "async is off by default" code
Now that everyone has been able to test the async code (and it's being used
in the Moblin betas by default), we can enable it by default.
The various fixes needed have gone into 2.6.29 already.

[With an important bugfix from Stefan Richter]

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-03-28 13:05:30 -07:00
Ingo Molnar
82268da1b1 Merge branch 'linus' into percpu-cpumask-x86-for-linus-2
Conflicts:
	arch/sparc/kernel/time_64.c
	drivers/gpu/drm/drm_proc.c

Manual merge to resolve build warning due to phys_addr_t type change
on x86:

	drivers/gpu/drm/drm_info.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-28 04:26:01 +01:00
Linus Torvalds
3ae5080f4c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits)
  fs: avoid I_NEW inodes
  Merge code for single and multiple-instance mounts
  Remove get_init_pts_sb()
  Move common mknod_ptmx() calls into caller
  Parse mount options just once and copy them to super block
  Unroll essentials of do_remount_sb() into devpts
  vfs: simple_set_mnt() should return void
  fs: move bdev code out of buffer.c
  constify dentry_operations: rest
  constify dentry_operations: configfs
  constify dentry_operations: sysfs
  constify dentry_operations: JFS
  constify dentry_operations: OCFS2
  constify dentry_operations: GFS2
  constify dentry_operations: FAT
  constify dentry_operations: FUSE
  constify dentry_operations: procfs
  constify dentry_operations: ecryptfs
  constify dentry_operations: CIFS
  constify dentry_operations: AFS
  ...
2009-03-27 16:23:12 -07:00
Sukadev Bhattiprolu
a3ec947c85 vfs: simple_set_mnt() should return void
simple_set_mnt() is defined as returning 'int' but always returns 0.
Callers assume simple_set_mnt() never fails and don't properly cleanup if
it were to _ever_ fail.  For instance, get_sb_single() and get_sb_nodev()
should:

        up_write(sb->s_unmount);
        deactivate_super(sb);

if simple_set_mnt() fails.

Since simple_set_mnt() never fails, would be cleaner if it did not
return anything.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-27 14:44:03 -04:00
Al Viro
3ba13d179e constify dentry_operations: rest
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-27 14:44:03 -04:00
Ingo Molnar
6e15cf0486 Merge branch 'core/percpu' into percpu-cpumask-x86-for-linus-2
Conflicts:
	arch/parisc/kernel/irq.c
	arch/x86/include/asm/fixmap_64.h
	arch/x86/include/asm/setup.h
	kernel/irq/handle.c

Semantic merge:
        arch/x86/include/asm/fixmap.h

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-27 17:28:43 +01:00
Linus Torvalds
a8416961d3 Merge branch 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (32 commits)
  x86: disable __do_IRQ support
  sparseirq, powerpc/cell: fix unused variable warning in interrupt.c
  genirq: deprecate obsolete typedefs and defines
  genirq: deprecate __do_IRQ
  genirq: add doc to struct irqaction
  genirq: use kzalloc instead of explicit zero initialization
  genirq: make irqreturn_t an enum
  genirq: remove redundant if condition
  genirq: remove unused hw_irq_controller typedef
  irq: export remove_irq() and setup_irq() symbols
  irq: match remove_irq() args with setup_irq()
  irq: add remove_irq() for freeing of setup_irq() irqs
  genirq: assert that irq handlers are indeed running in hardirq context
  irq: name 'p' variables a bit better
  irq: further clean up the free_irq() code flow
  irq: refactor and clean up the free_irq() code flow
  irq: clean up manage.c
  irq: use GFP_KERNEL for action allocation in request_irq()
  kernel/irq: fix sparse warning: make symbol static
  irq: optimize init_kstat_irqs/init_copy_kstat_irqs
  ...
2009-03-26 16:06:50 -07:00
Linus Torvalds
6671de344c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
  posix timers: fix RLIMIT_CPU && fork()
  time: ntp: fix bug in ntp_update_offset() & do_adjtimex(), fix
  time: ntp: clean up second_overflow()
  time: ntp: simplify ntp_tick_adj calculations
  time: ntp: make 64-bit constants more robust
  time: ntp: refactor do_adjtimex() some more
  time: ntp: refactor do_adjtimex()
  time: ntp: fix bug in ntp_update_offset() & do_adjtimex()
  time: ntp: micro-optimize ntp_update_offset()
  time: ntp: simplify ntp_update_offset_fll()
  time: ntp: refactor and clean up ntp_update_offset()
  time: ntp: refactor up ntp_update_frequency()
  time: ntp: clean up ntp_update_frequency()
  time: ntp: simplify the MAX_TICKADJ_SCALED definition
  time: ntp: simplify the second_overflow() code flow
  time: ntp: clean up kernel/time/ntp.c
  x86: hpet: stop HPET_COUNTER when programming periodic mode
  x86: hpet: provide separate functions to stop and start the counter
  x86: hpet: print HPET registers during setup (if hpet=verbose is used)
  time: apply NTP frequency/tick changes immediately
  ...
2009-03-26 16:05:42 -07:00
Linus Torvalds
831576fe40 Merge branch 'sched-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (46 commits)
  sched: Add comments to find_busiest_group() function
  sched: Refactor the power savings balance code
  sched: Optimize the !power_savings_balance during fbg()
  sched: Create a helper function to calculate imbalance
  sched: Create helper to calculate small_imbalance in fbg()
  sched: Create a helper function to calculate sched_domain stats for fbg()
  sched: Define structure to store the sched_domain statistics for fbg()
  sched: Create a helper function to calculate sched_group stats for fbg()
  sched: Define structure to store the sched_group statistics for fbg()
  sched: Fix indentations in find_busiest_group() using gotos
  sched: Simple helper functions for find_busiest_group()
  sched: remove unused fields from struct rq
  sched: jiffies not printed per CPU
  sched: small optimisation of can_migrate_task()
  sched: fix typos in documentation
  sched: add avg_overlap decay
  x86, sched_clock(): mark variables read-mostly
  sched: optimize ttwu vs group scheduling
  sched: TIF_NEED_RESCHED -> need_reshed() cleanup
  sched: don't rebalance if attached on NULL domain
  ...
2009-03-26 16:05:01 -07:00
David S. Miller
08abe18af1 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/net/wimax/i2400m/usb-notif.c
2009-03-26 15:23:24 -07:00
Linus Torvalds
0c93ea4064 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (61 commits)
  Dynamic debug: fix pr_fmt() build error
  Dynamic debug: allow simple quoting of words
  dynamic debug: update docs
  dynamic debug: combine dprintk and dynamic printk
  sysfs: fix some bin_vm_ops errors
  kobject: don't block for each kobject_uevent
  sysfs: only allow one scheduled removal callback per kobj
  Driver core: Fix device_move() vs. dpm list ordering, v2
  Driver core: some cleanup on drivers/base/sys.c
  Driver core: implement uevent suppress in kobject
  vcs: hook sysfs devices into object lifetime instead of "binding"
  driver core: fix passing platform_data
  driver core: move platform_data into platform_device
  sysfs: don't block indefinitely for unmapped files.
  driver core: move knode_bus into private structure
  driver core: move knode_driver into private structure
  driver core: move klist_children into private structure
  driver core: create a private portion of struct device
  driver core: remove polling for driver_probe_done(v5)
  sysfs: reference sysfs_dirent from sysfs inodes
  ...

Fixed conflicts in drivers/sh/maple/maple.c manually
2009-03-26 11:17:04 -07:00
Ingo Molnar
7c526e1fef Merge branches 'timers/new-apis', 'timers/ntp' and 'timers/urgent' into timers/core 2009-03-26 15:45:52 +01:00
Ingo Molnar
a5ebc0b1a7 Merge commit 'v2.6.29' into timers/core 2009-03-26 15:45:22 +01:00
Tom Zanussi
9a8118baae tracing: filter fix for TRACE_EVENT_FORMAT events
Impact: fix crash (hang) when using TRACE_EVENT_FORMAT filter files

filters are only hooked up to the tracepoint events defined using
TRACE_EVENT but not the tracers that use TRACE_EVENT_FORMAT, such
as ftrace.

Do not display the filter files at all for TRACE_EVENT_FORMAT events
for the time being.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237878882.8339.61.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-26 09:13:14 +01:00
Steven Rostedt
34886c8bc5 tracing: add average time in function to function profiler
Show the average time in the function (Time / Hit)

  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  mwait_idle                              51    140326.6 us     2751.503 us
  smp_apic_timer_interrupt                47    3517.735 us     74.845 us
  schedule                                10    2738.754 us     273.875 us
  __schedule                              10    2732.857 us     273.285 us
  hrtimer_interrupt                       47    1896.104 us     40.342 us
  irq_exit                                56    1711.833 us     30.568 us
  __run_hrtimer                           47    1315.589 us     27.991 us
  tick_sched_timer                        47    1138.690 us     24.227 us
  do_softirq                              56    1116.829 us     19.943 us
  __do_softirq                            56    1066.932 us     19.052 us
  do_IRQ                                   9    926.153 us      102.905 us

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-25 21:03:36 -04:00
Steven Rostedt
318e0a73c9 tracing: remove on the fly allocator from function profiler
Impact: safer code

The on the fly allocator for the function profiler was to save
memory. But at the expense of stability. Although it survived several
tests, allocating from the function tracer is just too risky, just
to save space.

This patch removes the allocator and simply allocates enough entries
at start up.

Each function gets a profiling structure of 40 bytes. With an average
of 20K functions, and this is for each CPU, we have 800K per online
CPU. This is not too bad, at least for non-embedded.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-25 21:03:17 -04:00
Zhaolei
2a4efa4245 ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
"Because when we call ftrace_free_rec we change the rec->ip to point to the
  next record in the chain. Something is very wrong if rec->ip >= s &&
  rec->ip < e and the record is already free."

 "Note, use FTRACE_WARN_ON() macro. This way it shuts down ftrace if it is
  hit and helps to avoid further damage later."
                   -- Steven Rostedt <rostedt@goodmis.org>

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-25 17:45:36 -04:00
Lai Jiangshan
2f63b840bc trace_workqueues: fix empty line's output
Empty lines separate cpus stat. After previous
fix(trace_stat: keep original order) applied, the empty lines
are displayed at incorrect position.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49C9F266.2060706@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 18:32:35 +01:00
Lai Jiangshan
220ba351df trace_stat: keep original order
Impact: make trace_stat files show items with the original order

trace_stat tracer reverse the items, it makes the output
looks a little ugly.

Example, when we read trace_stat/workqueues, we get cpu#7's stat.
at first, and then cpu#6... cpu#0.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49C9F23F.5040307@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 18:32:34 +01:00
Lai Jiangshan
e6f489013b trace_stat: don't call seq_printf() in seq_operation->start()
Impact: Fix incorrect way using seq_file's API

Use SEQ_START_TOKEN instead of calling ->stat_headers()
int seq_operation->start().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
LKML-Reference: <49C9EAE5.5070202@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 18:32:34 +01:00
Steven Rostedt
fb9fb015e9 tracing: clean up tracing profiler
Ingo Molnar suggested clean ups for the profiling code. This patch
makes those updates.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-25 13:26:41 -04:00
Gautham R Shenoy
b7bb4c9bb0 sched: Add comments to find_busiest_group() function
Impact: cleanup

Add /** style comments around find_busiest_group(). Also add a few
explanatory comments.

This concludes the find_busiest_group() cleanup. The function is
now down to 72 lines from the original 313 lines.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091427.13992.18933.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 13:28:30 +01:00
Gautham R Shenoy
c071df1852 sched: Refactor the power savings balance code
Impact: cleanup

Create seperate helper functions to initialize the
power-savings-balance related variables, to update them and
to check if we have a scope for performing power-savings balance.

Add no-op inline functions for the !(CONFIG_SCHED_MC || CONFIG_SCHED_SMT)
case.

This will eliminate all the #ifdef jungle in find_busiest_group() and the
other helper functions.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091422.13992.73616.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:48 +01:00
Gautham R Shenoy
a021dc0337 sched: Optimize the !power_savings_balance during fbg()
Impact: cleanup, micro-optimization

We don't need to perform power_savings balance if either the
cpu is NOT_IDLE or if the sched_domain doesn't contain the
SD_POWERSAVINGS_BALANCE flag set.

Currently, we check for these conditions multiple number of
times, even though these variables don't change over the scope
of find_busiest_group().

Check once, and store the value in the already exiting
"power_savings_balance" variable.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091417.13992.2657.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:48 +01:00
Gautham R Shenoy
dbc523a3b8 sched: Create a helper function to calculate imbalance
Move all the imbalance calculation out of find_busiest_group()
through this helper function.

With this change, the structure of find_busiest_group() will be
as follows:

- update_sched_domain_statistics.

- check if imbalance exits.

- update imbalance and return busiest.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091411.13992.43293.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:47 +01:00
Gautham R Shenoy
2e6f44aeda sched: Create helper to calculate small_imbalance in fbg()
Impact: cleanup

We have two places in find_busiest_group() where we need to calculate
the minor imbalance before returning the busiest group. Encapsulate
this functionality into a seperate helper function.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091406.13992.54316.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:47 +01:00
Gautham R Shenoy
37abe198b1 sched: Create a helper function to calculate sched_domain stats for fbg()
Impact: cleanup

Create a helper function named update_sd_lb_stats() to update the
various sched_domain related statistics in find_busiest_group().

With this we would have moved all the statistics computation out of
find_busiest_group().

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091401.13992.88737.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:46 +01:00
Gautham R Shenoy
222d656dea sched: Define structure to store the sched_domain statistics for fbg()
Impact: cleanup

Currently we use a lot of local variables in find_busiest_group()
to capture the various statistics related to the sched_domain.
Group them together into a single data structure.

This will help us to offload the job of updating the sched_domain
statistics to a helper function.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091356.13992.25970.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:46 +01:00
Gautham R Shenoy
1f8c553d0f sched: Create a helper function to calculate sched_group stats for fbg()
Impact: cleanup

Create a helper function named update_sg_lb_stats() which
can be invoked to calculate the individual group's statistics
in find_busiest_group().

This reduces the lenght of find_busiest_group() considerably.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Aked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091351.13992.43461.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:45 +01:00
Gautham R Shenoy
381be78fdc sched: Define structure to store the sched_group statistics for fbg()
Impact: cleanup

Currently a whole bunch of variables are used to store the
various statistics pertaining to the groups we iterate over
in find_busiest_group().

Group them together in a single data structure and add
appropriate comments.

This will be useful later on when we create helper functions
to calculate the sched_group statistics.

Credit: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
LKML-Reference: <20090325091345.13992.20099.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:45 +01:00
Gautham R Shenoy
6dfdb06290 sched: Fix indentations in find_busiest_group() using gotos
Impact: cleanup

Some indentations in find_busiest_group() can minimized by using
early exits with the help of gotos. This improves readability in
a couple of cases.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091340.13992.45062.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:44 +01:00
Gautham R Shenoy
67bb6c036d sched: Simple helper functions for find_busiest_group()
Impact: cleanup

Currently the load idx calculation code is in find_busiest_group().
Move that to a static inline helper function.

Similary, to find the first cpu of a sched_group we use
cpumask_first(sched_group_cpus(group))

Use a helper to that. It improves readability in some cases.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Balbir Singh" <balbir@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Dhaval Giani" <dhaval@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: "Vaidyanathan Srinivasan" <svaidy@linux.vnet.ibm.com>
LKML-Reference: <20090325091335.13992.55424.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-25 10:30:44 +01:00
Ingo Molnar
b6d9842258 Merge branch 'sched/cleanups'; commit 'v2.6.29' into sched/core 2009-03-25 10:26:51 +01:00
Steven Rostedt
a2a16d6a31 function-graph: add option to calculate graph time or not
graph time is the time that a function is executing another function.
Thus if function A calls B, if graph-time is set, then the time for
A includes B. This is the default behavior. But if graph-time is off,
then the time spent executing B is subtracted from A.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 23:41:11 -04:00
Steven Rostedt
cafb168a1c tracing: make the function profiler per cpu
Impact: speed enhancement

By making the function profiler record in per cpu data we not only
get better readings, avoid races, we also do not have to take any
locks.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 23:41:10 -04:00
Steven Rostedt
0706f1c48c tracing: adding function timings to function profiler
If the function graph trace is enabled, the function profiler will
use it to take the timing of the functions.

 cat /debug/tracing/trace_stat/functions

  Function                               Hit    Time
  --------                               ---    ----
  mwait_idle                             127    183028.4 us
  schedule                                26    151997.7 us
  __schedule                              31    151975.1 us
  sys_wait4                                2    74080.53 us
  do_wait                                  2    74077.80 us
  sys_newlstat                           138    39929.16 us
  do_path_lookup                         179    39845.79 us
  vfs_lstat_fd                           138    39761.97 us
  user_path_at                           153    39469.58 us
  path_walk                              179    39435.76 us
  __link_path_walk                       189    39143.73 us
[...]

Note the times are skewed due to the function graph tracer not taking
into account schedules.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 23:41:09 -04:00
Steven Rostedt
493762fc53 tracing: move function profiler data out of function struct
Impact: reduce size of memory in function profiler

The function profiler originally introduces its counters into the
function records itself. There is 20 thousand different functions on
a normal system, and that is adding 20 thousand counters for profiling
event when not needed.

A normal run of the profiler yields only a couple of thousand functions
executed, depending on what is being profiled. This means we have around
18 thousand useless counters.

This patch rectifies this by moving the data out of the function
records used by dynamic ftrace. Data is preallocated to hold the functions
when the profiling begins. Checks are made during profiling to see if
more recorcds should be allocated, and they are allocated if it is safe
to do so.

This also removes the dependency from using dynamic ftrace, and also
removes the overhead by having it enabled.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 23:41:06 -04:00
Steven Rostedt
bac429f037 tracing: add function profiler
Impact: new profiling feature

This patch adds a function profiler. In debugfs/tracing/ two new
files are created.

  function_profile_enabled  - to enable or disable profiling

  trace_stat/functions   - the profiled functions.

For example:

  echo 1 > /debugfs/tracing/function_profile_enabled
  ./hackbench 50
  echo 0 > /debugfs/tracing/function_profile_enabled

yields:

  cat /debugfs/tracing/trace_stat/functions

  Function                               Hit
  --------                               ---
  _spin_lock                        10106442
  _spin_unlock                      10097492
  kfree                              6013704
  _spin_unlock_irqrestore            4423941
  _spin_lock_irqsave                 4406825
  __phys_addr                        4181686
  __slab_free                        4038222
  dput                               4030130
  path_put                           4023387
  unroll_tree_refs                   4019532
[...]

The most hit functions are listed first. Functions that are not
hit are not listed.

This feature depends on and uses dynamic function tracing. When the
function profiling is disabled, no overhead occurs. But it still
takes up around 300KB to hold the data, thus it is not recomended
to keep it enabled for systems low on memory.

When a '1' is echoed into the function_profile_enabled file, the
counters for is function is reset back to zero. Thus you can see what
functions are hit most by different programs.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 23:40:00 -04:00
Steven Rostedt
425480081e tracing: add handler to trace_stat
Currently, if a trace_stat user wants a handle to some private data,
the trace_stat infrastructure does not supply a way to do that.

This patch passes the trace_stat structure to the start function of
the trace_stat code.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 23:22:58 -04:00
Jason Baron
e9d376f0fa dynamic debug: combine dprintk and dynamic printk
This patch combines Greg Bank's dprintk() work with the existing dynamic
printk patchset, we are now calling it 'dynamic debug'.

The new feature of this patchset is a richer /debugfs control file interface,
(an example output from my system is at the bottom), which allows fined grained
control over the the debug output. The output can be controlled by function,
file, module, format string, and line number.

for example, enabled all debug messages in module 'nf_conntrack':

echo -n 'module nf_conntrack +p' > /mnt/debugfs/dynamic_debug/control

to disable them:

echo -n 'module nf_conntrack -p' > /mnt/debugfs/dynamic_debug/control

A further explanation can be found in the documentation patch.

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-24 16:38:26 -07:00
Luis Henriques
67aa0f767a sched: remove unused fields from struct rq
Impact: cleanup, new schedstat ABI

Since they are used on in statistics and are always set to zero, the
following fields from struct rq have been removed: yld_exp_empty,
yld_act_empty and yld_both_empty.

Both Sched Debug and SCHEDSTAT_VERSION versions has also been
incremented since ABIs have been changed.

The schedtop tool has been updated to properly handle new version of
schedstat:

   http://rt.wiki.kernel.org/index.php/Schedtop_utility

Signed-off-by: Luis Henriques <henrix@sapo.pt>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20090324221002.GA10061@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 23:16:51 +01:00
Lai Jiangshan
ee000b7f9f tracing: use union for multi-usages field
Impact: cleanup

struct dyn_ftrace::ip has different usages in his lifecycle,
we use union for it. And also for struct dyn_ftrace::flags.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49C871BE.3080405@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 16:43:12 +01:00
Lai Jiangshan
cc59c9e8d0 ftrace: show virtual PID
Impact: fix PID output under namespaces

When current namespace is not the global namespace,
pid read from set_ftrace_pid is no correct.

 # ~/newpid_namespace_run bash
 # echo $$
 1
 # echo 1 > set_ftrace_pid
 # cat set_ftrace_pid
 3756

Since we write virtual PID to set_ftrace_pid, we need get
virtual PID when we read it.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49C84D65.9050606@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 16:42:49 +01:00
Steven Rostedt
be6f164a02 function-graph: add option for include sleep times
Impact: give user a choice to show times spent while sleeping

The user may want to see the time a function spent sleeping.
This patch adds the trace option "sleep-time" to allow that.
The "sleep-time" option is default on.

 echo sleep-time > /debug/tracing/trace_options

produces:

 ------------------------------------------
 2)  avahi-d-3428  =>    <idle>-0
 ------------------------------------------

 2)               |      finish_task_switch() {
 2)   0.621 us    |        _spin_unlock_irq();
 2)   2.202 us    |      }
 2) ! 1002.197 us |    }
 2) ! 1003.521 us |  }

where as,

 echo nosleep-time > /debug/tracing/trace_options

produces:

 0)    <idle>-0    =>  yum-upd-3416
 ------------------------------------------

 0)               |              finish_task_switch() {
 0)   0.643 us    |                _spin_unlock_irq();
 0)   2.342 us    |              }
 0) + 41.302 us   |            }
 0) + 42.453 us   |          }

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 11:06:24 -04:00
Steven Rostedt
8aef2d2856 function-graph: ignore times across schedule
Impact: more accurate timings

The current method of function graph tracing does not take into
account the time spent when a task is not running. This shows functions
that call schedule have increased costs:

 3) + 18.664 us   |      }
 ------------------------------------------
 3)    <idle>-0    =>  kblockd-123
 ------------------------------------------

 3)               |      finish_task_switch() {
 3)   1.441 us    |        _spin_unlock_irq();
 3)   3.966 us    |      }
 3) ! 2959.433 us |    }
 3) ! 2961.465 us |  }

This patch uses the tracepoint in the scheduling context switch to
account for time that has elapsed while a task is scheduled out.
Now we see:

 ------------------------------------------
 3)    <idle>-0    =>  edac-po-1067
 ------------------------------------------

 3)               |      finish_task_switch() {
 3)   0.685 us    |        _spin_unlock_irq();
 3)   2.331 us    |      }
 3) + 41.439 us   |    }
 3) + 42.663 us   |  }

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 09:33:30 -04:00
Steven Rostedt
05ce5818ad function-graph: prevent more than one tracer registering
Impact: prevent crash due to multiple function graph tracers

The function graph tracer can currently only handle a single tracer
being registered. If another tracer registers with the function
graph tracer it can crash the system.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 09:32:52 -04:00
Steven Rostedt
5d1a03dc54 function-graph: moved the timestamp from arch to generic code
This patch move the timestamp from happening in the arch specific
code into the general code. This allows for better control by the tracer
to time manipulation.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 09:31:34 -04:00
Steven Rostedt
098335215a tracing: fix memory leak in trace_stat
If the function profiler does not have any items recorded and one were
to cat the function stat file, the kernel would take a BUG with a NULL
pointer dereference.

Looking further into this, I found that returning NULL from stat_start
did not stop the stat logic, and would later call stat_next. This breaks
from the way seq_file works, so I looked into fixing the stat code.

This is where I noticed that the last next_entry is never freed.
It is allocated, and if the stat_next returns NULL, the code breaks out
of the loop, unlocks the mutex and exits. We never link the next_entry
nor do we free it. Thus it is a real memory leak.

This patch rearranges the code a bit to not only fix the memory leak,
but also to act more like seq_file where nothing is printed if there
is nothing to print. That is, stat_start returns NULL.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-24 09:07:35 -04:00
Li Zefan
093419971e blktrace: print human-readable act_mask
Impact: new feature, allow symbolic values in /debug/tracing/act_mask

Print stringified act_mask instead of hex value:

 # cat act_mask
 read,write,barrier,sync,queue,requeue,issue,complete,fs,pc,ahead,meta,
 discard,drv_data
 # echo "meta,write" > act_mask
 # cat act_mask
 write,meta

Also:
 - make act_mask accept "ahead", "meta", "discard" and "drv_data"
 - use strsep() instead of strchr() to parse user input
 - return -EINVAL if a token is not found in the mask map
 - fix a bug that 'value' is unsigned, so it can < 0
 - propagate error value of blk_trace_mask2str() to userspace, but not
   always return -ENXIO.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49C8AB42.1000802@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 13:09:00 +01:00
Li Zefan
e0dc81bec0 blktrace: fix t_error()
Impact: fix error flag output

t_error() should return t->error but not t->sector.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49C8945F.5020802@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 13:09:00 +01:00
Li Zefan
65796348e0 blktrace: fix wrong calculation of RWBS
Impact: fix the output of IO type category characters

Trace categories are the upper 16 bits, not the lower 16 bits.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49C89432.8010805@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 13:08:59 +01:00
Li Zefan
e4955c9986 blktrace: mark ddir_act[] const
Impact: cleanup

ddir_act and what2act always stay immutable.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49C89415.5080503@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 13:08:59 +01:00
Thomas Gleixner
f48fe81e5b genirq: threaded irq handlers review fixups
Delta patch to address the review comments.

      - Implement warning when IRQ_WAKE_THREAD is requested and no
        thread handler installed
      - coding style fixes

Pointed-out-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-03-24 12:15:23 +01:00
Arjan van de Ven
935bd5b971 genirq: add support for threaded interrupts to devres
Some devices use devres_request_irq() for to install their interrupt
handler. Add support for threaded interrupts to devres as well.

[tglx - simplified and adapted to latest threadirq version]

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-03-24 12:15:23 +01:00
Thomas Gleixner
3aa551c9b4 genirq: add threaded interrupt handler support
Add support for threaded interrupt handlers:

A device driver can request that its main interrupt handler runs in a
thread. To achive this the device driver requests the interrupt with
request_threaded_irq() and provides additionally to the handler a
thread function. The handler function is called in hard interrupt
context and needs to check whether the interrupt originated from the
device. If the interrupt originated from the device then the handler
can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is
returned when no further action is required. IRQ_WAKE_THREAD causes
the genirq code to invoke the threaded (main) handler. When
IRQ_WAKE_THREAD is returned handler must have disabled the interrupt
on the device level. This is mandatory for shared interrupt handlers,
but we need to do it as well for obscure x86 hardware where disabling
an interrupt on the IO_APIC level redirects the interrupt to the
legacy PIC interrupt lines.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 12:15:23 +01:00
Tom Zanussi
9f58a159d0 tracing/filters: disallow integer values for string filters and vice versa
Impact: fix filter use boundary condition / crash

Make sure filters for string fields don't use integer values and vice
versa.  Getting it wrong can crash the system or produce bogus
results.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237878882.8339.61.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 08:26:52 +01:00
Tom Zanussi
4bda2d517b tracing/filters: use trace_seq_printf() to print filters
Impact: cleanup

Instead of just using the trace_seq buffer to print the filters, use
trace_seq_printf() as it was intended to be used.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237878871.8339.59.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 08:26:52 +01:00
Tom Zanussi
09f1f245c7 tracing/filters: free pred when clearing filters
Impact: fix (small) per trace filter modification memory leak

Free the current pred when clearing the filters via the filter files.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237878851.8339.58.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 08:26:51 +01:00
Tom Zanussi
1fc2d5c119 tracing/filters: use list_for_each_entry
Impact: cleanup

No need to use the safe version here, so use list_for_each_entry instead
of list_for_each_entry_safe in find_event_field().

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237878841.8339.57.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-24 08:26:51 +01:00
Benjamin Herrenschmidt
9e41d9597e Merge commit 'origin/master' into next 2009-03-24 13:38:30 +11:00
James Morris
703a3cd728 Merge branch 'master' into next 2009-03-24 10:52:46 +11:00
Frederic Weisbecker
1618536961 tracing/function-graph-tracer: fix functions call traces imbalance
Impact: fix traces output

Sometimes one can observe an imbalance in the traces between function
calls and function return traces:

func1() {
    }
}

The curly brace inside func1() is the return of another function nested
inside func1. The return trace have been inserted in the buffer but not
the entry.
We are storing a return address on the function traces stack while we
haven't inserted its entry on the buffer, hence the imbalance on the
traces.

This is because the tracers doesn't check all failures that can happen
on buffer insertion.

This patch reports the tracing recursion failures and the ring buffer
failures. In such cases, we now restore the original return address for
the function, giving up its return trace.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237843021-11695-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 23:25:32 +01:00
Anton Vorontsov
45b9560895 tracing: Fix TRACING_SUPPORT dependency for PPC32
commit 40ada30f96 ("tracing: clean up menu"),
despite the "clean up" in its purpose, introduced a behavioural
change for Kconfig symbols: we no longer able to select tracing
support on PPC32 (because IRQFLAGS_SUPPORT isn't yet implemented).

The IRQFLAGS_SUPPORT is not mandatory for most tracers, tracing core
has a special case for platforms w/o irqflags (which, by the way, has
become useless as of the commit above).

Though according to Ingo Molnar, there was periodic build failures on
weird, unmaintained architectures that had no irqflags-tracing support
and hence didn't know the raw_irqs_save/restore primitives. Thus we'd
better not enable irqflags-less tracing for all architectures.

This patch restores the old behaviour for PPC32, and thus brings the
tracing back. Other architectures can either add themselves to the
exception list or (better) implement TRACE_IRQFLAGS_SUPPORT.

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-b: Steven Rostedt <rostedt@goodmis.org>
Cc: linuxppc-dev@ozlabs.org
LKML-Reference: <20090323220724.GA9851@oksana.dev.rtsoft.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 23:23:03 +01:00
Thomas Gleixner
80c5520811 Merge branch 'cpus4096' into irq/threaded
Conflicts:
	arch/parisc/kernel/irq.c
	kernel/irq/handle.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-03-23 21:20:20 +01:00
Oleg Nesterov
37bebc70d7 posix timers: fix RLIMIT_CPU && fork()
See http://bugzilla.kernel.org/show_bug.cgi?id=12911

copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because
posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus
fastpath_timer_check() returns false unless we have other cpu timers.

This is the minimal fix for 2.6.29 (tested) and 2.6.28. The patch is not
optimal, we need further cleanups here. With this patch update_rlimit_cpu()
is not really needed, but I don't think it should be removed.

The proper fix (I think) is:

	- set_process_cpu_timer() should just start the cputimer->running
	  logic (it does), no need to change cputime_expires.xxx_exp

	- posix_cpu_timers_init_group() should set ->running when needed

	- fastpath_timer_check() can check ->running instead of
	  task_cputime_zero(signal->cputime_expires)

Reported-by: Peter Lojkin <ia6432@inbox.ru>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: <stable@kernel.org> [for 2.6.29.x]
LKML-Reference: <20090323193411.GA17514@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 20:43:35 +01:00
Miklos Szeredi
53da1d9456 fix ptrace slowness
This patch fixes bug #12208:

  Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=12208
  Subject         : uml is very slow on 2.6.28 host

This turned out to be not a scheduler regression, but an already
existing problem in ptrace being triggered by subtle scheduler
changes.

The problem is this:

 - task A is ptracing task B
 - task B stops on a trace event
 - task A is woken up and preempts task B
 - task A calls ptrace on task B, which does ptrace_check_attach()
 - this calls wait_task_inactive(), which sees that task B is still on the runq
 - task A goes to sleep for a jiffy
 - ...

Since UML does lots of the above sequences, those jiffies quickly add
up to make it slow as hell.

This patch solves this by not rescheduling in read_unlock() after
ptrace_stop() has woken up the tracer.

Thanks to Oleg Nesterov and Ingo Molnar for the feedback.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-23 09:22:31 -07:00
Ingo Molnar
efd247fa34 Merge branches 'sched/debug' and 'linus' into sched/core 2009-03-23 16:53:20 +01:00
Frederic Weisbecker
3e1f60b80c tracing/ftrace: check if debugfs is registered before creating files
Impact: fix a crash with ftrace={nop,boot} parameter

If the nop or initcall tracers are launched as boot tracers,
they will attempt to create their option directory and files.
But these tracers are registered very early and then assigned
as "boot tracers" very early if asked to.

Since they do this before debugfs has been registered (core initcall),
a crash is triggered.

Another early tracers could also come later. So we fix it by
checking if debugfs is initialized before creating the root
tracing directory.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237759847-21025-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 16:25:47 +01:00
Ingo Molnar
b3e3b302cf Merge branches 'irq/sparseirq' and 'linus' into irq/core 2009-03-23 10:07:49 +01:00
Tom Zanussi
c4cff064be tracing/filters: clean up filter_add_subsystem_pred()
Impact: cleanup, memory leak fix

This patch cleans up filter_add_subsystem_pred():

- searches for the field before creating a copy of the pred

- fixes memory leak in the case a predicate isn't applied

- if -ENOMEM, makes sure there's no longer a reference to the
  pred so the caller can free the half-finished filter

- changes the confusing i == MAX_FILTER_PRED - 1 comparison
  previously remarked upon

This affects only per-subsystem event filtering.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237796808.7527.40.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 09:30:37 +01:00
Tom Zanussi
ee6cdabc82 tracing/filters: fix bug in copy_pred()
Impact: fix potential crash on subsystem filter expression freeing

When making a copy of the predicate, pred->field_name needs to be
duplicated in the copy as well, otherwise bad things can happen due to
later multiple frees of the same string.

This affects only per-subsystem event filtering.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237796802.7527.39.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 09:30:36 +01:00
Tom Zanussi
75c8b41752 tracing/filters: use list_for_each_entry_safe
Impact: cleanup

Use list_for_each_entry_safe instead of list_for_each_entry in
find_event_field().

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237796788.7527.35.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 09:28:07 +01:00
Frederic Weisbecker
b118415bfa tracing/events: don't discard an event after commit
When we want to filter an event, the filter test is done after
the event is commited to the ring-buffer to be discarded later if
needed.

But a reader could be reading this event while we are trying to discard
it. Other kind of racy events can even happen because the event is
commited and can be read and/or consumed.

What we want is to discard the event before committing it.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <1237763919-21505-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 09:22:15 +01:00
Frederic Weisbecker
7e6ea92df3 tracing/ftrace: make nop-tracer use polling wait for events on pipe
Impact: display events when they arrive

Now that the events don't use wake_up() anymore, we need the nop
tracer to poll waiting for events on the pipe. Especially because
nop is useful to look at orphan traces types (traces types that
don't rely on specific tracers) because it doesn't produce traces
itself.

And unlike other tracers that trigger specific traces periodically,
nop triggers no traces by itself that can wake him.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237759847-21025-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 09:22:15 +01:00
Frederic Weisbecker
07edf71213 tracing/events: don't use wake up for events
Impact: fix hard-lockup with sched switch events

Some ftrace events, such as sched wakeup, can be traced
while the runqueue lock is hold. Since they are using
trace_current_buffer_unlock_commit(), they call wake_up()
which can try to grab the runqueue lock too, resulting in
a deadlock.

Now for all event, we call a new helper:
trace_nowake_buffer_unlock_commit() which do pretty the same than
trace_current_buffer_unlock_commit() except than it doesn't call
trace_wake_up().

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237759847-21025-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 09:22:14 +01:00
Frederic Weisbecker
9bd7d099ab tracing/events: make the filter files writable
We need the filter files to be writable, the current
filter file permissions are only set readable.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <1237759847-21025-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-23 09:22:14 +01:00
Ingo Molnar
fe9f57f250 tracing: add run-time field descriptions for event filtering, kfree fix
Impact: fix potential kfree of random data in (rare) failure path

Zero-initialize the field structure.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <1237710639.7703.46.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 18:43:25 +01:00
Tom Zanussi
cfb180f3e7 tracing: add per-subsystem filtering
This patch adds per-subsystem filtering to the event tracing subsystem.

It adds a 'filter' debugfs file to each subsystem directory.  This file
can be written to to set filters; reading from it will display the
current set of filters set for that subsystem.

Basically what it does is propagate the filter down to each event
contained in the subsystem.  If a particular event doesn't have a field
with the name specified in the filter, it simply doesn't get set for
that event.  You can verify whether or not the filter was set for a
particular event by looking at the filter file for that event.

As with per-event filters, compound expressions are supported, echoing
'0' to the subsystem's filter file clears all filters in the subsystem,
etc.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237710677.7703.49.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 18:38:47 +01:00
Tom Zanussi
7ce7e42499 tracing: add per-event filtering
This patch adds per-event filtering to the event tracing subsystem.

It adds a 'filter' debugfs file to each event directory.  This file can
be written to to set filters; reading from it will display the current
set of filters set for that event.

Basically, any field listed in the 'format' file for an event can be
filtered on (including strings, but not yet other array types) using
either matching ('==') or non-matching ('!=') 'predicates'.  A
'predicate' can be either a single expression:

 # echo pid != 0 > filter

 # cat filter
 pid != 0

or a compound expression of up to 8 sub-expressions combined using '&&'
or '||':

 # echo comm == Xorg > filter
 # echo "&& sig != 29" > filter

 # cat filter
 comm == Xorg
 && sig != 29

Only events having field values matching an expression will be available
in the trace output; non-matching events are discarded.

Note that a compound expression is built up by echoing each
sub-expression separately - it's not the most efficient way to do
things, but it keeps the parser simple and assumes that compound
expressions will be relatively uncommon.  In any case, a subsequent
patch introducing a way to set filters for entire subsystems should
mitigate any need to do this for lots of events.

Setting a filter without an '&&' or '||' clears the previous filter
completely and sets the filter to the new expression:

 # cat filter
 comm == Xorg
 && sig != 29

 # echo comm != Xorg

 # cat filter
 comm != Xorg

To clear a filter, echo 0 to the filter file:

 # echo 0 > filter
 # cat filter
 none

The limit of 8 predicates for a compound expression is arbitrary - for
efficiency, it's implemented as an array of pointers to predicates, and
8 seemed more than enough for any filter...

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237710665.7703.48.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 18:38:46 +01:00
Tom Zanussi
2d622719f1 tracing: add ring_buffer_event_discard() to ring buffer
This patch overloads RINGBUF_TYPE_PADDING to provide a way to discard
events from the ring buffer, for the event-filtering mechanism
introduced in a subsequent patch.

I did the initial version but thanks to Steven Rostedt for adding
the parts that actually made it work. ;-)

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 18:38:25 +01:00
Dmitri Vorobiev
b8b9426533 tracing: fix four sparse warnings
Impact: cleanup.

This patch fixes the following sparse warnings:

 kernel/trace/trace.c:385:9: warning: symbol 'trace_seq_to_buffer' was
 not declared. Should it be static?

 kernel/trace/trace_clock.c:29:13: warning: symbol 'trace_clock_local'
 was not declared. Should it be static?

 kernel/trace/trace_clock.c:54:13: warning: symbol 'trace_clock' was not
 declared. Should it be static?

 kernel/trace/trace_clock.c:74:13: warning: symbol 'trace_clock_global'
 was not declared. Should it be static?

Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com>
LKML-Reference: <1237741871-5827-4-git-send-email-dmitri.vorobiev@movial.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 18:16:54 +01:00
Dmitri Vorobiev
f80d2d7725 tracing, Text Edit Lock: Fix one sparse warning in kernel/extable.c
Impact: cleanup.

The global mutex text_mutex if declared in linux/memory.h, so
this file needs to be included into kernel/extable.c, where the
same mutex is defined. This fixes the following sparse warning:

 kernel/extable.c:32:1: warning: symbol 'text_mutex' was not declared.
 Should it be static?

Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com>
LKML-Reference: <1237741871-5827-3-git-send-email-dmitri.vorobiev@movial.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 18:16:20 +01:00
Tom Zanussi
cf027f645e tracing: add run-time field descriptions for event filtering
This patch makes the field descriptions defined for event tracing
available at run-time, for the event-filtering mechanism introduced
in a subsequent patch.

The common event fields are prepended with 'common_' in the format
display, allowing them to be distinguished from the other fields
that might internally have same name and can therefore be
unambiguously used in filters.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237710639.7703.46.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 18:11:22 +01:00
Ingo Molnar
a524446fe8 Merge branches 'tracing/ftrace', 'tracing/hw-breakpoints', 'tracing/ring-buffer', 'tracing/textedit' and 'linus' into tracing/core 2009-03-22 18:10:02 +01:00
Frederic Weisbecker
0cf53ff62b tracing: keep the tracing buffer after self-test failure
Instead of using ftrace_dump_on_oops, it's far more convenient
to have the trace leading up to a self-test failure available
in /debug/tracing/trace.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237694675-23509-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 15:17:21 +01:00
Frederic Weisbecker
cf586b61f8 tracing/function-graph-tracer: prevent hangs during self-tests
Impact: detect tracing related hangs

Sometimes, with some configs, the function graph tracer can make
the timer interrupt too much slow, hanging the kernel in an endless
loop of timer interrupts servicing.

As suggested by Ingo, this patch brings a watchdog which stops the
selftest after a defined number of functions traced, definitely
disabling this tracer.

For those who want to debug the cause of the function graph trace
hang, you can pass the ftrace_dump_on_oops kernel parameter to dump
the traces after this hang detection.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237694675-23509-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-22 14:06:40 +01:00
Li Zefan
b125130b22 blktrace: avoid accessing NULL bdev->bd_disk
bdev->bd_disk can be NULL, if the block device is not opened.

Try this against an unmounted partition, and you'll see NULL dereference:

  # echo 1 > /sys/block/sda/sda5/enable

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49C30098.6080107@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 16:17:24 +01:00
Li Zefan
cd649b8bb8 blktrace: remove sysfs_blk_trace_enable_show/store()
sysfs_blk_trace_enable_show()/store() share most of code with
sysfs_blk_trace_attr_show()/store().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49C30EA3.1060004@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 16:17:08 +01:00
Li Zefan
15152e448b blktrace: report EBUSY correctly
blk_trace_remove_queue() returns EINVAL if q->blk_trace == NULL,
but blk_trace_setup_queue() doesn't return EBUSY if
q->blk_trace != NULL.

 # echo 0 > sdaX/trace/enable
 # echo 0 > sdaX/trace/enable
 bash: echo: write error: Invalid argument
 # echo 1 > sdaX/trace/enable
 # echo 1 > sdaX/trace/enable
 (should return EBUSY)

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49C2F614.2010101@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 16:16:54 +01:00
Li Zefan
cbe28296eb blktrace: don't increase blk_probes_ref if failed to setup blk trace
do_blk_trace_setup() may return EBUSY, but the current code
doesn't decrease blk_probes_ref in this case.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49C2F5FF.80002@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 16:16:37 +01:00
Li Zefan
3c289ba7c3 blktrace: remove blk_probe_mutex
blk_register_tracepoints() always returns 0, so make it return void,
thus we don't need to use blk_probe_mutex to protect blk_probes_ref.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49C2F5EA.8060606@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 16:16:25 +01:00
Li Zefan
5006ea73f3 blktrace: make blk_tracer_enabled a bool flag
It doesn't have to be a counter, and it can be a bool flag instead.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <49C2F5D3.8090104@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 16:16:13 +01:00
Li Zefan
1a17662ea0 blktrace: fix possible memory leak
When we failed to create "block" debugfs dir, we should do some
cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <49C2F5B2.8000800@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 16:15:47 +01:00
Frederic Weisbecker
09c9e84d47 tracing/ring-buffer: don't annotate rb_cpu_notify with __cpuinit
Impact: remove a section warning

CONFIG_DEBUG_SECTION_MISMATCH raises the following warning on -tip:

  WARNING: kernel/trace/built-in.o(.text+0x5bc5): Section mismatch in
  reference from the function ring_buffer_alloc() to the function
  .cpuinit.text:rb_cpu_notify()
  The function ring_buffer_alloc() references
  the function __cpuinit rb_cpu_notify().

This is actually harmless. The code in the ring buffer don't build
rb_cpu_notify and other cpu hotplug stuffs when !CONFIG_HOTPLUG_CPU
so we have no risk to reference freed memory here (it would even
be harmless if we unconditionally build it because register_cpu_notifier
would do nothing when !CONFIG_HOTPLUG_CPU.

But since ring_buffer_alloc() can be called everytime, we don't want it
to be annotated with __cpuinit so we drop the __cpuinit from
rb_cpu_notify.

This is not a waste of memory because it is only defined and used on
CONFIG_HOTPLUG_CPU.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237606416-22268-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-21 10:54:10 +01:00
Ingo Molnar
505f2b970b tracing, Text Edit Lock - kprobes architecture independent support, nommu fix
Impact: build fix on SH !CONFIG_MMU

Stephen Rothwell reported this linux-next build failure on the SH
architecture:

  kernel/built-in.o: In function `disable_all_kprobes':
  kernel/kprobes.c:1382: undefined reference to `text_mutex'
  [...]

And observed:

| Introduced by commit 4460fdad85 ("tracing,
| Text Edit Lock - kprobes architecture independent support") from the
| tracing tree.  text_mutex is defined in mm/memory.c which is only built
| if CONFIG_MMU is defined, which is not true for sh allmodconfig.

Move this lock to kernel/extable.c (which is already home to various
kernel text related routines), which file is always built-in.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
LKML-Reference: <20090320110602.86351a91.sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-20 11:09:27 +01:00
Peter Zijlstra
ac199db018 ftrace: event profile hooks
Impact: new tracing infrastructure feature

Provide infrastructure to generate software perf counter events
from tracepoints.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20090319194233.557364871@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-20 10:17:07 +01:00
Peter Zijlstra
28bea271e5 ftrace: ensure every event gets an id
Impact: widen user-space visibe event IDs to all events

Previously only TRACE_EVENT events got ids, because only they
generated raw output which needs to be demuxed from the trace.

In order to provide a unique ID for each event, register everybody,
regardless.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20090319194233.464914218@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-20 10:17:06 +01:00
Peter Zijlstra
23725aeeab ftrace: provide an id file for each event
Since not every event has a format file to read the id from,
expose it explicitly in a separate file.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20090319194233.372534033@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-20 10:17:05 +01:00
Ingo Molnar
44fc6ee923 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-20 10:15:13 +01:00
Ingo Molnar
22de89b371 Merge branches 'tracing/ftrace', 'tracing/kprobes', 'tracing/tasks' and 'linus' into tracing/core 2009-03-20 10:14:53 +01:00
Steven Rostedt
5087f8d2a2 function-graph: show binary events as comments
With the added TRACE_EVENT macro, the events no longer appear in
the function graph tracer. This was because the function graph
did not know how to display the entries. The graph tracer was
only aware of its own entries and the printk entries.

By using the event call back feature, the graph tracer can now display
the events.

 # echo irq > /debug/tracing/set_event

Which can show:

 0)               |          handle_IRQ_event() {
 0)               |            /* irq_handler_entry: irq=48 handler=eth0 */
 0)               |            e1000_intr() {
 0)   0.926 us    |              __napi_schedule();
 0)   3.888 us    |            }
 0)               |            /* irq_handler_exit: irq=48 return=handled */
 0)   0.655 us    |            runqueue_is_locked();
 0)               |            __wake_up() {
 0)   0.831 us    |              _spin_lock_irqsave();

The irq entry and exit events show up as comments.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-19 15:58:56 -04:00
Steven Rostedt
40ce74f19c tracing: remove recording function depth from trace_printk
The function depth in trace_printk was to facilitate the function
graph output. Now that the function graph calculates the depth within
the trace output, we no longer need to record the depth when the
trace_printk is called.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-19 15:58:47 -04:00
Steven Rostedt
2fbcdb35ac function-graph: calculate function depth within function graph tracer
Currently, the function graph tracer depends on the trace_printk
to record the depth. All the information is already there in the trace
to calculate function depth, with the exception of having the printk
be the first item. But as soon as a entry or exit is reached, then
we know the depth.

This patch changes the iter->private data from recording a per cpu
last_pid, to a structure that holds both the last_pid and the current
depth. This data is used to determine the function depth for the
printks.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-19 15:58:38 -04:00
Steven Rostedt
5ef841f6f3 tracing: make print_(b)printk_msg_only global
This patch makes print_printk_msg_only and print_bprintk_msg_only
global for other functions to use. It also renames them by adding
a "trace_" to the beginning to avoid namespace collisions.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-19 15:57:55 -04:00
Frederic Weisbecker
3bf832ce1f tracing/ring-buffer: fix non cpu hotplug case
Impact: fix warning with irqsoff tracer

The ring buffer allocates its buffers on pre-smp time (early_initcall).
It means that, at first, only the boot cpu buffer is allocated and
the ring-buffer cpumask only has the boot cpu set (cpu_online_mask).

Later, the secondary cpu will show up and the ring-buffer will be notified
about this event: the appropriate buffer will be allocated and the cpumask
will be updated.

Unfortunately, if !CONFIG_CPU_HOTPLUG, the ring-buffer will not be
notified about the secondary cpus, meaning that the cpumask will have
only the cpu boot set, and only one cpu buffer allocated.

We fix that by using cpu_possible_mask if !CONFIG_CPU_HOTPLUG.

This patch fixes the following warning with irqsoff tracer running:

[  169.317794] WARNING: at kernel/trace/trace.c:466 update_max_tr_single+0xcc/0xf3()
[  169.318002] Hardware name: AMILO Li 2727
[  169.318002] Modules linked in:
[  169.318002] Pid: 5624, comm: bash Not tainted 2.6.29-rc8-tip-02636-g6aafa6c #11
[  169.318002] Call Trace:
[  169.318002]  [<ffffffff81036182>] warn_slowpath+0xea/0x13d
[  169.318002]  [<ffffffff8100b9d6>] ? ftrace_call+0x5/0x2b
[  169.318002]  [<ffffffff8100b9d6>] ? ftrace_call+0x5/0x2b
[  169.318002]  [<ffffffff8100b9d1>] ? ftrace_call+0x0/0x2b
[  169.318002]  [<ffffffff8101ef10>] ? ftrace_modify_code+0xa9/0x108
[  169.318002]  [<ffffffff8106e27f>] ? trace_hardirqs_off+0x25/0x27
[  169.318002]  [<ffffffff8149afe7>] ? _spin_unlock_irqrestore+0x1f/0x2d
[  169.318002]  [<ffffffff81064f52>] ? ring_buffer_reset_cpu+0xf6/0xfb
[  169.318002]  [<ffffffff8106637c>] ? ring_buffer_reset+0x36/0x48
[  169.318002]  [<ffffffff8106aeda>] update_max_tr_single+0xcc/0xf3
[  169.318002]  [<ffffffff8100bc17>] ? sysret_check+0x22/0x5d
[  169.318002]  [<ffffffff8106e3ea>] stop_critical_timing+0x142/0x204
[  169.318002]  [<ffffffff8106e4cf>] trace_hardirqs_on_caller+0x23/0x25
[  169.318002]  [<ffffffff8149ac28>] trace_hardirqs_on_thunk+0x3a/0x3c
[  169.318002]  [<ffffffff8100bc17>] ? sysret_check+0x22/0x5d
[  169.318002] ---[ end trace db76cbf775a750cf ]---

Because this tracer may try to swap two cpu ring buffers for an
unregistered cpu on the ring buffer.

This patch might also fix a fair loss of traces due to unallocated buffers
for secondary cpus.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-b: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237470453-5427-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-19 16:41:08 +01:00
Steven Rostedt
ac5f6c9685 function-graph: consolidate prologues for output
Impact: clean up

The prologue of the function graph entry, return and comments all
start out pretty much the same. Each of these duplicate code and
do so slightly differently.

This patch consolidates the printing of the pid, absolute time,
cpu and proc (and for entry, the interrupt).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-19 11:29:23 -04:00
Rusty Russell
8c083f081d cpumask: remove cpumask allocation from idle_balance, fix
Impact: fix boot crash

Fix typo in the size calculation.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <alpine.DEB.2.00.0903181729360.31583@gandalf.stny.rr.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-19 13:49:44 +01:00
Ingo Molnar
4a44bac1f9 symbols, stacktrace: look up init symbols after module symbols
Impact: fix incomplete stacktraces

I noticed such weird stacktrace entries in lockdep dumps:

[    0.285956] {HARDIRQ-ON-W} state was registered at:
[    0.285956]   [<ffffffff802bce90>] mark_irqflags+0xbe/0x125
[    0.285956]   [<ffffffff802bf2fd>] __lock_acquire+0x674/0x82d
[    0.285956]   [<ffffffff802bf5b2>] lock_acquire+0xfc/0x128
[    0.285956]   [<ffffffff8135b636>] rt_spin_lock+0xc8/0xd0
[    0.285956]   [<ffffffffffffffff>] 0xffffffffffffffff

The stacktrace entry is cut off after rt_spin_lock.

After much debugging i found out that stacktrace entries that
belong to init symbols dont get printed out, due to commit:

  a2da405: module: Don't report discarded init pages as kernel text.

The reason is this check added to core_kernel_text():

-       if (addr >= (unsigned long)_sinittext &&
+       if (system_state == SYSTEM_BOOTING &&
+           addr >= (unsigned long)_sinittext &&
            addr <= (unsigned long)_einittext)
                return 1;

This will discard inittext symbols even though their symbol table
is still present and even though stacktraces done while the system
was booting up might still be relevant.

To not reintroduce the (not well-specified) bug addressed in that
commit, first do a module symbols lookup, then a final init-symbols
lookup.

This will work fine on architectures that have separate address
spaces for modules (such as x86) - and should not crash any other
architectures either.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <new-discussion>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-19 13:38:35 +01:00
Rusty Russell
df7c8e845e cpumask: remove cpumask allocation from idle_balance
Impact: fix circular locking

Steven reports a circular locking from alloc_cpumask_var doing
a wakeup. We get rid of this using the tried-and-true technique
of using a per-cpu cpumask_var_t rather than doing an alloc
every time.

Simpler and more robust than a rare, implicit allocation within
an atomic codepath.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <alpine.DEB.2.00.0903181729360.31583@gandalf.stny.rr.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-19 08:15:15 +01:00
Ingo Molnar
ec625cb29e tracepoints: dont update zero-sized tracepoint sections
Zero-sized tracepoint sections can occur if tracing is enabled but
no tracepoint is defined. Do not emit a warning in that case.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
LKML-Reference: <1237394936.3132.1.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 19:55:00 +01:00
Jaswinder Singh Rajput
09933a108e tracing: fix oops in tracepoint_update_probe_range()
Change this crash:

 BUG: unable to handle kernel NULL pointer dereference at (null)
 IP: [<ffffffff8107d4de>] tracepoint_update_probe_range+0x1f/0x9b
 PGD 13d5fb067 PUD 13d688067 PMD 0
 Oops: 0000 [#1] SMP

To a more debuggable WARN_ONCE().

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237394936.3132.1.camel@localhost.localdomain>
[ moved the check outside the lock and added a WARN_ON(). ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 18:54:39 +01:00
Steven Rostedt
4acd4d00f7 tracing: give easy way to clear trace buffer
There is currently no easy way to clear the trace buffer. Currently
the only way is to change the current tracer.

This patch lets the user clear the trace buffer by simply writing
into the trace files.

 echo > /debug/tracing/trace

or to clear a single cpu (i.e. for CPU 1):

 echo > /debug/tracing/per_cpu/cpu1/trace

Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-18 10:52:47 -04:00
Ananth N Mavinakayanahalli
f02b8624fe kprobes: Fix locking imbalance in kretprobes
Fix locking imbalance in kretprobes:

=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
kthreadd/2 is trying to release lock (&rp->lock) at:
[<c06b3080>] pre_handler_kretprobe+0xea/0xf4
but there are no more locks to release!

other info that might help us debug this:
1 lock held by kthreadd/2:
 #0:  (rcu_read_lock){..--}, at: [<c06b2b24>] __atomic_notifier_call_chain+0x0/0x5a

stack backtrace:
Pid: 2, comm: kthreadd Not tainted 2.6.29-rc8 #1
Call Trace:
 [<c06ae498>] ? printk+0xf/0x17
 [<c06b3080>] ? pre_handler_kretprobe+0xea/0xf4
 [<c044ce6c>] print_unlock_inbalance_bug+0xc3/0xce
 [<c0444d4b>] ? clocksource_read+0x7/0xa
 [<c04450a4>] ? getnstimeofday+0x5f/0xf6
 [<c044a9ca>] ? register_lock_class+0x17/0x293
 [<c044b72c>] ? mark_lock+0x1e/0x30b
 [<c0448956>] ? tick_dev_program_event+0x4a/0xbc
 [<c0498100>] ? __slab_alloc+0xa5/0x415
 [<c06b2fbe>] ? pre_handler_kretprobe+0x28/0xf4
 [<c06b3080>] ? pre_handler_kretprobe+0xea/0xf4
 [<c044cf1b>] lock_release_non_nested+0xa4/0x1a5
 [<c06b3080>] ? pre_handler_kretprobe+0xea/0xf4
 [<c044d15d>] lock_release+0x141/0x166
 [<c06b07dd>] _spin_unlock_irqrestore+0x19/0x50
 [<c06b3080>] pre_handler_kretprobe+0xea/0xf4
 [<c06b20b5>] kprobe_exceptions_notify+0x1c9/0x43e
 [<c06b2b02>] notifier_call_chain+0x26/0x48
 [<c06b2b5b>] __atomic_notifier_call_chain+0x37/0x5a
 [<c06b2b24>] ? __atomic_notifier_call_chain+0x0/0x5a
 [<c06b2b8a>] atomic_notifier_call_chain+0xc/0xe
 [<c0442d0d>] notify_die+0x2d/0x2f
 [<c06b0f9c>] do_int3+0x1f/0x71
 [<c06b0e84>] int3+0x2c/0x34
 [<c042d476>] ? do_fork+0x1/0x288
 [<c040221b>] ? kernel_thread+0x71/0x79
 [<c043ed1b>] ? kthread+0x0/0x60
 [<c043ed1b>] ? kthread+0x0/0x60
 [<c04040b8>] ? kernel_thread_helper+0x0/0x10
 [<c043ec7f>] kthreadd+0xac/0x148
 [<c043ebd3>] ? kthreadd+0x0/0x148
 [<c04040bf>] kernel_thread_helper+0x7/0x10

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org> [2.6.29.x, 2.6.28.x, 2.6.27.x]
LKML-Reference: <20090318113621.GB4129@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 12:51:16 +01:00
Frederic Weisbecker
4903620034 tracing/ftrace: stop {irqs, preempt}soff tracers when tracing is stopped
Impact: fix a selftest warning

In some cases, it's possible to see the following warning on irqsoff
tracer selftest:

[    4.640003] Testing tracer irqsoff: <4>------------[ cut here ]------------
[    4.653562] WARNING: at kernel/trace/trace.c:458 update_max_tr_single+0x9a/0xc4()
[    4.660000] Hardware name: System Product Name
[    4.660000] Modules linked in:
[    4.660000] Pid: 301, comm: kstop/1 Not tainted 2.6.29-rc8-tip #35837
[    4.660000] Call Trace:
[    4.660000]  [<4014b588>] warn_slowpath+0x79/0x8f
[    4.660000]  [<402d6949>] ? put_dec+0x64/0x6b
[    4.660000]  [<40162b56>] ? getnstimeofday+0x58/0xdd
[    4.660000]  [<40162210>] ? clocksource_read+0x3/0xf
[    4.660000]  [<4015eb44>] ? ktime_set+0x8/0x34
[    4.660000]  [<4014101a>] ? balance_runtime+0x8/0x56
[    4.660000]  [<405f6f11>] ? _spin_lock+0x3/0x10
[    4.660000]  [<4011f643>] ? ftrace_call+0x5/0x8
[    4.660000]  [<4015d0f1>] ? task_cputime_zero+0x3/0x27
[    4.660000]  [<40190ee7>] ? cpupri_set+0x90/0xcb
[    4.660000]  [<405f7208>] ? _spin_lock_irqsave+0x22/0x34
[    4.660000]  [<40190f12>] ? cpupri_set+0xbb/0xcb
[    4.660000]  [<405f7151>] ? _spin_unlock_irqrestore+0x23/0x35
[    4.660000]  [<4018493f>] ? ring_buffer_reset_cpu+0x27/0x51
[    4.660000]  [<405f7208>] ? _spin_lock_irqsave+0x22/0x34
[    4.660000]  [<40184962>] ? ring_buffer_reset_cpu+0x4a/0x51
[    4.660000]  [<405f7151>] ? _spin_unlock_irqrestore+0x23/0x35
[    4.660000]  [<4018cc29>] ? trace_hardirqs_off+0x1a/0x1c
[    4.660000]  [<405f7151>] ? _spin_unlock_irqrestore+0x23/0x35
[    4.660000]  [<40184962>] ? ring_buffer_reset_cpu+0x4a/0x51
[    4.660000]  [<401850f3>] ? cpumask_next+0x15/0x18
[    4.660000]  [<4018a41f>] update_max_tr_single+0x9a/0xc4
[    4.660000]  [<4014e5fe>] ? exit_notify+0x16/0xf2
[    4.660000]  [<4018cd13>] check_critical_timing+0xcc/0x11e
[    4.660000]  [<4014e5fe>] ? exit_notify+0x16/0xf2
[    4.660000]  [<4014e5fe>] ? exit_notify+0x16/0xf2
[    4.660000]  [<4018cdf1>] stop_critical_timing+0x8c/0x9f
[    4.660000]  [<4014e5c4>] ? forget_original_parent+0xac/0xd0
[    4.660000]  [<4018ce3a>] trace_hardirqs_on+0x1a/0x1c
[    4.660000]  [<4014e5c4>] forget_original_parent+0xac/0xd0
[    4.660000]  [<4014e5fe>] exit_notify+0x16/0xf2
[    4.660000]  [<4014e8a5>] do_exit+0x1cb/0x225
[    4.660000]  [<4015c72b>] ? kthread+0x0/0x69
[    4.660000]  [<4011f61d>] kernel_thread_helper+0xd/0x10
[    4.660000] ---[ end trace a7919e7f17c0a725 ]---
[    4.660164] .. no entries found ..FAILED!

During the selftest of irqsoff tracer, we do that:

	/* disable interrupts for a bit */
	local_irq_disable();
	udelay(100);
	local_irq_enable();
	/* stop the tracing. */
	tracing_stop();
	/* check both trace buffers */
	ret = trace_test_buffer(tr, NULL);

If a callsite performs a new max delay with irqs off just after
tracing_stop, update_max_tr_single() -> ring_buffer_swap_cpu()
will be called with the buffers disabled by tracing_stop(), hence
the warning, then ring_buffer_swap_cpu() return -EAGAIN and
update_max_tr_single() complains.

Fix it by also stopping the tracer before stopping the tracing globally.
A similar situation can happen with preemptoff and preemptirqsoff tracers
where we apply the same fix.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1237325938-5240-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 10:12:38 +01:00
Carsten Emde
a635cf0497 tracing: fix command line to pid reverse map
Impact: fix command line to pid mapping

map_cmdline_to_pid[] is checked in trace_save_cmdline(), but never
updated. This results in stale pid to command line mappings and the
tracer output will associate the wrong comm string.

Signed-off-by: Carsten Emde <Carsten.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 10:10:18 +01:00
Thomas Gleixner
50d88758a3 tracing: fix trace_find_cmdline()
Impact: prevent stale command line output

In case there is no valid command line mapping for a pid
trace_find_cmdline() returns without updating the comm buffer. The
trace dump keeps the previous entry which results in confusing trace
output:

     <idle>-0     [000]   280.702056 ....
     <idle>-23456 [000]   280.702080 ....

Update the comm buffer with "<...>" when no mapping is found.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 10:10:17 +01:00
Thomas Gleixner
2c7eea4c62 tracing: replace the crude (unsigned) -1 hackery
Impact: cleanup

The command line recorder uses (unsigned) -1 to mark non mapped
entries in the pid to command line maps. The validity check is
completely unintuitive: idx >= SAVED_CMDLINES

There is no need for such casting games. Use a constant to mark
unmapped entries and check for that constant to make the code readable
and understandable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 10:10:17 +01:00
Thomas Gleixner
18aecd362a tracing: stop command line recording when tracing is disabled
Impact: prevent overwrite of command line entries

When the tracer is stopped the command line recording continues to
record. The check for tracing_is_on() is not sufficient here as the
ringbuffer status is not affected by setting
debug/tracing/tracing_enabled to 0. On a non idle system this can
result in the loss of the command line information for the stopped
trace, which makes the trace harder to read and analyse.

Check tracer_enabled to allow further recording.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 10:10:16 +01:00
Luis Henriques
af66df5ecf sched: jiffies not printed per CPU
The jiffies value was being printed for each CPU, which does not seem to make
sense.  Moved jiffies to system section.

Signed-off-by: Luis Henriques <henrix@sapo.pt>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20090318000425.GA2228@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-18 09:57:26 +01:00
Ingo Molnar
327019b01e Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-18 06:59:56 +01:00
Steven Rostedt
62524d55e5 tracing: make power tracer start/stop methods lighter weight
The start/stop methods of a tracer should be able to be executed
in all contexts. This patch converts the power tracer to do so.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-17 23:12:11 -04:00
Steven Rostedt
5fec6ddcb4 tracing: make sched_switch stop/start light weight
The stopping and starting of a tracer should be light weight and
be able to be called in all contexts. The sched_switch grabbed
mutexes in the start/stop functions. This patch changes it to a
simple variable, on/off.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-17 23:10:45 -04:00
Steven Rostedt
af4617bdba tracing: add global-clock option to provide cross CPU clock to traces
Impact: feature to allow better serialized clock

This patch adds an option called "global-clock" that will allow
the tracer to switch to a slower but more accurate (across CPUs)
clock.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-17 23:10:35 -04:00
Steven Rostedt
37886f6a9f ring-buffer: add api to allow a tracer to change clock source
This patch adds a new function called ring_buffer_set_clock that
allows a tracer to assign its own clock source to the buffer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-17 23:06:31 -04:00
Masami Hiramatsu
6e2b75740b module: fix refptr allocation and release order
Impact: fix ref-after-free crash on failed module load

Fix refptr bug: Change refptr allocation and release order not to access a module
data structure pointed by 'mod' after freeing mod->module_core.
This bug will cause kernel panic(e.g. failed to find undefined symbols).

This bug was reported on systemtap bugzilla.
http://sources.redhat.com/bugzilla/show_bug.cgi?id=9927

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-18 09:31:21 +10:30
Guillaume Knispel
f2d28a2ebc printk: correct the behavior of printk_timed_ratelimit()
Impact: fix jiffies-comparison sign-wrap behavior

The behavior provided by printk_timed_ratelimit() is, in some
situations, probably not what a caller would reasonably expect:

bool printk_timed_ratelimit(unsigned long *caller_jiffies,
			unsigned int interval_msecs)
{
	if (*caller_jiffies == 0 || time_after(jiffies, *caller_jiffies)) {
		*caller_jiffies = jiffies + msecs_to_jiffies(interval_msecs);
		return true;
	}
	return false;
}

On a 32 bit computer, if printk_timed_ratelimit() is initially called at
time jiffies == Ja, *caller_jiffies is set to
Ja + msecs_to_jiffies(interval_msecs): let's say Ja + 42 for this
example.

If this caller then doesn't call printk_timed_ratelimit() until
jiffies == Ja + (1 << 31) + 42 (which can happen as soon as ~ 25 days
later on a 1000 HZ system), printk_timed_ratelimit() will then always
return false to this caller until jiffies loops completely (1 << 31 more
ticks).

Ths change makes it only return false if jiffies is in the small
time window starting at the previous call when true was returned and
ending interval_msecs later.  Note that if jiffies loops completely
between two calls to printk_timed_ratelimit(), it will obviously still
wrongly return false, but this is something with a low probability.

If something completely reliable is needed I guess jiffies_64 must be
used (which this change does not do).

Signed-off-by: Guillaume Knispel <gknispel@proformatique.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20090317161842.0059096b@xilun.lan.proformatique.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-17 16:25:28 +01:00
Luis Henriques
708dc51253 sched: small optimisation of can_migrate_task()
There were 3 invocations of task_hot() in can_migrate_task().

Replace these 3 invocations by only one invocation, cached in
a local variable.

Signed-off-by: Luis Henriques <henrix@sapo.pt>
LKML-Reference: <20090316195902.GA6197@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-17 12:04:41 +01:00
Luis Henriques
80dd99b368 sched: fix typos in documentation
Fixed typos in function documentation.

Signed-off-by: Luis Henriques <henrix@sapo.pt>
LKML-Reference: <20090316195809.GA6073@hades.domain.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-17 12:04:40 +01:00
Ingo Molnar
4176935b58 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-17 10:37:37 +01:00
Tom Zanussi
c269fc8c53 tracing: fix leak in event_format_read()
Impact: fix memory leak

If event_format_read() exits early due to nonzero ppos, the
previous kmalloc doesn't get freed - might as well do the
check before the kmalloc and avoid the problem.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237270859.8033.141.camel@charm-linux>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-17 08:38:23 +01:00
Steven Rostedt
6adaad14d7 tracing: stop comm recording on tracing off
Impact: fix for losing comms in trace

The command lines of tasks are cached at sched switch to not need
to record them at every trace point.  Disabling the tracing on stops
the recording of traces, but does not stop the caching of command lines.
When the tracing is off the cache may overflow and cause the tracing
to show incorrect tasks matching the PIDs.

This patch disables prevents updates to the comm cache when the ring buffer
is off.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-16 23:31:15 -04:00
Steven Rostedt
4ca5308523 tracing: protect reader of cmdline output
Impact: fix to one cause of incorrect comm outputs in trace

The spinlock only protected the creation of a comm <=> pid pair.
But it was possible that a reader could look up a pid, and get the
wrong comm because it had no locking.

This also required changing trace_find_cmdline to copy the comm cache
and not just send back a pointer to it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-16 23:27:06 -04:00
Frederic Weisbecker
03303549b1 tracing/ftrace: fix the check on nopped sites
Impact: fix a dynamic tracing failure

Recently, the function and function graph tracers failed to use dynamic
tracing after the following commit:

fa9d13cf13
(ftrace: don't try to __ftrace_replace_code on !FTRACE_FL_CONVERTED rec)

The patch is right except a mistake on the check for the FTRACE_FL_CONVERTED
flag. The code patching is aborted in case of successfully nopped sites.
What we want is the opposite: ignore the callsites that haven't been nopped.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-16 22:15:36 -04:00
Ingo Molnar
edb35028e4 Merge branches 'irq/genirq' and 'linus' into irq/core 2009-03-16 09:20:13 +01:00
Frederic Weisbecker
2fc1dfbe17 tracing/core: fix early free of cpumasks
Impact: fix crashes when tracing cpumasks

While ring-buffer allocation, the cpumasks are allocated too,
including the tracing cpumask and the per-cpu file mask handler.
But these cpumasks are freed accidentally just after.
Fix it.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237164303-11476-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-16 09:13:18 +01:00
Frederic Weisbecker
ac1d52d0b8 tracing/ftrace: fix double calls to tracing_start()
Impact: fix a warning during preemptirqsoff selftests

When the preemptirqsoff selftest fails, we see the following
warning:

[    6.050000] Testing tracer preemptirqsoff: .. no entries found ..
------------[ cut here ]------------
[    6.060000] WARNING: at kernel/trace/trace.c:688 tracing_start+0x67/0xd3()
[    6.060000] Modules linked in:
[    6.060000] Pid: 1, comm: swapper Tainted: G
[    6.060000] Call Trace:
[    6.060000]  [<ffffffff802460ff>] warn_slowpath+0xb1/0x100
[    6.060000]  [<ffffffff802a8f5b>] ? trace_preempt_on+0x35/0x4b
[    6.060000]  [<ffffffff802a37fb>] ? tracing_start+0x31/0xd3
[    6.060000]  [<ffffffff802a37fb>] ? tracing_start+0x31/0xd3
[    6.060000]  [<ffffffff80271e0b>] ? __lock_acquired+0xe6/0x1f2
[    6.060000]  [<ffffffff802a37fb>] ? tracing_start+0x31/0xd3
[    6.060000]  [<ffffffff802a3831>] tracing_start+0x67/0xd3
[    6.060000]  [<ffffffff802a8ace>] ? irqsoff_tracer_reset+0x2d/0x57
[    6.060000]  [<ffffffff802a4d1c>] trace_selftest_startup_preemptirqsoff+0x1c8/0x1f1
[    6.060000]  [<ffffffff802a4798>] register_tracer+0x12f/0x241
[    6.060000]  [<ffffffff810250d0>] ? init_irqsoff_tracer+0x0/0x53
[    6.060000]  [<ffffffff8102510b>] init_irqsoff_tracer+0x3b/0x53

This is because in fail case, the preemptirqsoff tracer selftest calls twice
the tracing_start() function:

int
trace_selftest_startup_preemptirqsoff(struct tracer *trace, struct trace_array *tr)
{
        if (!ret && !count) {
                printk(KERN_CONT ".. no entries found ..");
                ret = -1;
                tracing_start(); <-----
                goto out;
        }
        [...]
out:
        trace->reset(tr);
        tracing_start(); <------
        tracing_max_latency = save_max;

        return ret;
}

Since it is well handled in the out path, we don't need the conditional one.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237159961-7447-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-16 09:13:18 +01:00
Frederic Weisbecker
59f586db98 tracing/core: fix missing mutex unlock on tracing_set_tracer()
Impact: fix possible locking imbalance

In case of ring buffer resize failure, tracing_set_tracer forgot to
release trace_types_lock. Fix it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237151439-6755-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-16 09:13:17 +01:00
Frederic Weisbecker
0ea1c4156b tracing/syscalls: select kallsysms
Syscall tracing must select kallsysms.

The arch code builds a table to find the syscall metadata by syscall
number. It needs the syscalls names resolution from the symbol table
to know which name found on the syscalls metadatas match a function
pointer from the arch sys_call_table.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237151439-6755-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-16 09:13:17 +01:00
Frederic Weisbecker
5be71b61f1 tracing/syscalls: protect thread flag toggling from races
Impact: fix syscall tracer enable/disable race

The current thread flag toggling is racy as shown in the following
scenario:

- task A is the last user of syscall tracing, it releases the
  TIF_SYSCALL_FTRACE on each tasks

- at the same time task B start syscall tracing. refcount == 0 so
  it sets up TIF_SYSCALL_FTRACE on each tasks.

The effect of the mixup is unpredictable.
So this fix adds a mutex on {start,stop}_syscall_tracing().

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1237151439-6755-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-16 09:13:16 +01:00
Frederic Weisbecker
6404434525 tracing/syscalls: various cleanups
Impact: cleanup

- Drop unused cpu variable
- Fix some errors on comments

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237151439-6755-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-16 09:13:16 +01:00
Frederic Weisbecker
ac99c58c9e tracing/syscalls: fix missing release of tracing
Impact: fix 'stuck' syscall tracer

The syscall tracer uses a refcounter to enable several users
simultaneously.

But the refcounter did not behave correctly and always restored
its value to 0 after calling start_syscall_tracing(). Therefore,
stop_syscall_tracing() couldn't release correctly the tasks from
tracing.

Also the tracer forgot to reset the buffer when it is released.

Drop the pointless refcount decrement on start_syscall_tracing()
and reset the buffer when we release the tracer.

This fixes two reported issue:

- when we switch from syscall tracer to another tracer, syscall
  tracing continued.

- incorrect use of the refcount.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1237151439-6755-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-16 09:13:15 +01:00
Ingo Molnar
7243f2145a Merge branches 'tracing/ftrace', 'tracing/syscalls' and 'linus' into tracing/core
Conflicts:
	arch/parisc/kernel/irq.c
2009-03-16 09:12:42 +01:00
Frederic Weisbecker
bed1ffca02 tracing/syscalls: core infrastructure for syscalls tracing, enhancements
Impact: new feature

This adds the generic support for syscalls tracing. This is
currently exploited through a devoted tracer but other tracing
engines can use it. (They just have to play with
{start,stop}_ftrace_syscalls() and use the display callbacks
unless they want to override them.)

The syscalls prototypes definitions are abused here to steal
some metadata informations:

- syscall name, param types, param names, number of params

The syscall addr is not directly saved during this definition
because we don't know if its prototype is available in the
namespace. But we don't really need it. The arch has just to
build a function able to resolve the syscall number to its
metadata struct.

The current tracer prints the syscall names, parameters names
and values (and their types optionally). Currently the value is
a raw hex but higher level values diplaying is on my TODO list.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1236955332-10133-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 16:57:42 +01:00
Thomas Gleixner
0e57aa11ab genirq: deprecate __do_IRQ
Two years migration time is enough. Remove the compability cruft.

Add the deprecated warning in kernel/irq/handle.c because marking
__do_IRQ itself is way too noisy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-03-13 16:34:02 +01:00
Thomas Gleixner
4553573277 genirq: use kzalloc instead of explicit zero initialization
Impact: simplification

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2009-03-13 14:32:29 +01:00
Thomas Gleixner
c8e2aeef0b genirq: remove redundant if condition
Impact: cleanup

The code is only compiled if CONFIG_GENERIC_HARDIRQS=y so another
check for this define in the code is redundant. Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-03-13 14:32:28 +01:00
Ingo Molnar
e9a22d1fb9 x86, bts: cleanups
Impact: cleanup, no code changed

Cc: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090313104218.A30096@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:57:22 +01:00
Markus Metzger
321bb5e1ac x86, hw-branch-tracer: add selftest
Add a selftest for the hw-branch-tracer.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090313105027.A30183@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:57:21 +01:00
Markus Metzger
ba9372a8f3 x86, hw-branch-tracer: keep resources on stop
Distinguish init/reset and start/stop:

init/reset will allocate and release bts tracing resources
stop/start will suspend and resume bts tracing

Return an error on init() if no cpu can be traced.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090313104852.A30168@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:57:21 +01:00
Lai Jiangshan
e94142a67f ftrace: remove struct list_head from struct dyn_ftrace
Impact: save memory

The struct dyn_ftrace table is very large, this patch will save
about 50%.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49BA2C9F.8020009@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:36:20 +01:00
Lai Jiangshan
850a80cfaa ftrace: use seq_read
Impact: cleanup

VFS layer has tested the file mode, we do not need test it.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <49BA2BAB.6010608@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:35:31 +01:00
Ingo Molnar
c95dbf27e2 panic: clean up kernel/panic.c
Impact: cleanup, no code changed

Clean up kernel/panic.c some more and make it more consistent.

LKML-Reference: <49B91A7E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:25:53 +01:00
Ingo Molnar
d1dedb52ac panic, smp: provide smp_send_stop() wrapper on UP too
Impact: cleanup, no code changed

Remove an ugly #ifdef CONFIG_SMP from panic(), by providing
an smp_send_stop() wrapper on UP too.

LKML-Reference: <49B91A7E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:24:31 +01:00
Ingo Molnar
ffd71da4e3 panic: decrease oops_in_progress only after having done the panic
Impact: eliminate secondary warnings during panic()

We can panic() in a number of difficult, atomic contexts, hence
we use bust_spinlocks(1) in panic() to increase oops_in_progress,
which prevents various debug checks we have in place.

But in practice this protection only covers the first few printk's
done by panic() - it does not cover the later attempt to stop all
other CPUs and kexec(). If a secondary warning triggers in one of
those facilities that can make the panic message scroll off.

So do bust_spinlocks(0) only much later in panic(). (which code
is only reached if panic policy is relaxed that it can return
after a warning message)

Reported-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B91A7E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 11:06:47 +01:00
Ingo Molnar
cd80a8142e Merge branch 'x86/core' into core/ipi 2009-03-13 11:05:58 +01:00
Ingo Molnar
641cd4cfcd generic-ipi: eliminate WARN_ON()s during oops/panic
Do not output smp-call related warnings in the oops/panic codepath.

Reported-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49B91A7E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 10:47:34 +01:00
Ingo Molnar
88f502fedb futex: remove the pointer math from double_unlock_hb, fix
Impact: fix double unlock crash

Thomas Gleixner noticed that the simplified double_unlock_hb()
became ... too unsophisticated: in the hb1 == hb2 case it will
do a double unlock.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <20090312221118.11146.68610.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 10:32:07 +01:00
Zhaolei
fa9d13cf13 ftrace: don't try to __ftrace_replace_code on !FTRACE_FL_CONVERTED rec
Do __ftrace_replace_code for !FTRACE_FL_CONVERTED rec will always
fail, we should ignore this rec.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: "Steven Rostedt ;" <rostedt@goodmis.org>
LKML-Reference: <49BA2472.4060206@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 10:25:06 +01:00
Zhaolei
b00f0b6dc1 ftrace: avoid double-free of dyn_ftrace
If dyn_ftrace is freed before ftrace_release(), ftrace_release()
will free it again and make ftrace_free_records wrong.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: "Steven Rostedt ;" <rostedt@goodmis.org>
LKML-Reference: <49BA23D9.1050900@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 10:25:06 +01:00
Ingo Molnar
62a394eb77 Merge branches 'tracing/ftrace' and 'tracing/syscalls'; commit 'v2.6.29-rc8' into tracing/core 2009-03-13 10:23:39 +01:00
Frederic Weisbecker
ee08c6eccb tracing/ftrace: syscall tracing infrastructure, basics
Provide basic callbacks to do syscall tracing.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-2-git-send-email-fweisbec@gmail.com>
[ simplified it to a trace_printk() for now. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 06:25:43 +01:00
Steven Rostedt
899039e874 softirq: no need to have SOFTIRQ in softirq name
Impact: clean up

It is redundant to have 'SOFTIRQ' in the softirq names.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-13 00:43:33 -04:00
Steven Rostedt
7f96f93f02 tracing: move binary buffers into per cpu directory
The binary_buffers directory in /debugfs/tracing held the files
to read the trace buffers in a binary format. This held one file
per CPU buffer. But we also have a per_cpu directory that holds
a way to read the pretty-print formats.

This patch moves the binary buffers into the per_cpu_directory:

 # ls /debug/tracing/per_cpu/cpu1/
trace  trace_pipe  trace_pipe_raw

The new name is called "trace_pipe_raw". The binary buffers always
acted similar to trace_pipe, except that they produce raw data.

Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-13 00:37:42 -04:00
Rusty Russell
c69fc56de1 cpumask: use topology_core_cpumask/topology_thread_cpumask instead of cpu_core_map/cpu_sibling_map
Impact: cleanup

This is presumably what those definitions are for, and while all archs
define cpu_core_map/cpu_sibling map, that's changing (eg. x86 wants to
change it to a pointer).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-13 14:49:46 +10:30
Steven Rostedt
bdc067582b tracing: add comment for use of double __builtin_consant_p
Impact: documentation

The use of the double __builtin_contant_p checks in the event_trace_printk
can be confusing to developers and reviewers. This patch adds a comment
to explain why it is there.

Requested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20090313122235.43EB.A69D9226@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-13 00:15:46 -04:00
Steven Rostedt
eb1871f343 tracing: left align location header in stack_trace
Ingo Molnar suggested, instead of:

        Depth    Size      Location    (27 entries)
        -----    ----      --------
  0)     2880      48   lock_timer_base+0x2b/0x4f
  1)     2832      80   __mod_timer+0x33/0xe0
  2)     2752      16   __ide_set_handler+0x63/0x65

To have it be:

        Depth    Size   Location    (27 entries)
        -----    ----   --------
  0)     2880      48   lock_timer_base+0x2b/0x4f
  1)     2832      80   __mod_timer+0x33/0xe0
  2)     2752      16   __ide_set_handler+0x63/0x65

Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-13 00:00:58 -04:00
Ingo Molnar
f6411fe7e0 Merge branches 'sched/clock', 'sched/urgent' and 'linus' into sched/core 2009-03-13 04:50:44 +01:00
Steven Rostedt
5cc9854888 ring-buffer: document reader page design
In a private email conversation I explained how the ring buffer
page worked by using silly ASCII art. Ingo suggested that I add
that to the comments of the code.

Here it is.

Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 22:24:17 -04:00
Steven Rostedt
f28e55765e tracing: show event name in trace for TRACE_EVENT created events
Unlike TRACE_FORMAT() macros, the TRACE_EVENT() macros do not show
the event name in the trace file. Knowing the event type in the trace
output is very useful.

Instead of:

   task swapper:0 [140] ==> ntpd:3308 [120]

We now have:

   sched_switch: task swapper:0 [140] ==> ntpd:3308 [120]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 22:00:19 -04:00
KOSAKI Motohiro
889a6c3672 tracing: Don't use tracing_record_cmdline() in workqueue tracer fix
commit c3ffc7a40b
"Don't use tracing_record_cmdline() in workqueue tracer"
has a race window.

find_task_by_vpid() requires task_list_lock().

LKML-Reference: <20090313090042.43CD.A69D9226@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:23:47 -04:00
Jason Baron
39842323ce tracing: tracepoints for softirq entry/exit - tracepoints
Introduce softirq entry/exit tracepoints. These are useful for
augmenting existing tracers, and to figure out softirq frequencies and
timings.

[
  s/irq_softirq_/softirq_/ for trace point names and
  Fixed printf format in TRACE_FORMAT macro
   - Steven Rostedt
]

LKML-Reference: <20090312183603.GC3352@redhat.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:20:58 -04:00
Jason Baron
5d592b44b2 tracing: tracepoints for softirq entry/exit - add softirq-to-name array
Create a 'softirq_to_name' array, which is indexed by softirq #, so
that we can easily convert between the softirq index # and its name, in
order to get more meaningful output messages.

LKML-Reference: <20090312183336.GB3352@redhat.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:15:02 -04:00
Steven Rostedt
e447e1df2e tracing: explain why stack tracer is empty
If the stack tracing is disabled (by default) the stack_trace file
will only contain the header:

 # cat /debug/tracing/stack_trace
        Depth    Size      Location    (0 entries)
        -----    ----      --------

This can be frustrating to a developer that does not realize that the
stack tracer is disabled. This patch adds the following text:

  # cat /debug/tracing/stack_trace
        Depth    Size      Location    (0 entries)
        -----    ----      --------
 #
 #  Stack tracer disabled
 #
 # To enable the stack tracer, either add 'stacktrace' to the
 # kernel command line
 # or 'echo 1 > /proc/sys/kernel/stack_tracer_enabled'
 #

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:15:01 -04:00
Steven Rostedt
2da03ecee6 tracing: fix stack tracer header
The stack tracer use to look like this:

 # cat /debug/tracing/stack_trace
         Depth  Size      Location    (57 entries)
         -----  ----      --------
  0)     5088      16   mempool_alloc_slab+0x16/0x18
  1)     5072     144   mempool_alloc+0x4d/0xfe
  2)     4928      16   scsi_sg_alloc+0x48/0x4a [scsi_mod]

Now it looks like this:

 # cat /debug/tracing/stack_trace

        Depth    Size      Location    (57 entries)
        -----    ----      --------
  0)     5088      16   mempool_alloc_slab+0x16/0x18
  1)     5072     144   mempool_alloc+0x4d/0xfe
  2)     4928      16   scsi_sg_alloc+0x48/0x4a [scsi_mod]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:15:01 -04:00
Steven Rostedt
7975a2be16 tracing: export trace formats to user space
The binary printk saves a pointer to the format string in the ring buffer.
On output, the format is processed. But if the user is reading the
ring buffer through a binary interface, the pointer is meaningless.

This patch creates a file called printk_formats that maps the pointers
to the formats.

 # cat /debug/tracing/printk_formats
0xffffffff80713d40 : "irq_handler_entry: irq=%d handler=%s\n"
0xffffffff80713d48 : "lock_acquire: %s%s%s\n"
0xffffffff80713d50 : "lock_release: %s\n"

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:15:01 -04:00
Steven Rostedt
e9fb2b6d58 tracing: have event_trace_printk use static tracer
Impact: speed up on event tracing

The event_trace_printk is currently a wrapper function that calls
trace_vprintk. Because it uses a variable for the fmt it misses out
on the optimization of using the binary printk.

This patch makes event_trace_printk into a macro wrapper to use the
fmt as the same as the trace_printks.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:15:00 -04:00
Steven Rostedt
828275574e tracing: make bprint event use the proper event id
The bprint record is using TRACE_PRINT when it should be TRACE_BPRINT.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:15:00 -04:00
Frederic Weisbecker
48ead02030 tracing/core: bring back raw trace_printk for dynamic formats strings
Impact: fix callsites with dynamic format strings

Since its new binary implementation, trace_printk() internally uses static
containers for the format strings on each callsites. But the value is
assigned once at build time, which means that it can't take dynamic
formats.

So this patch unearthes the raw trace_printk implementation for the callers
that will need trace_printk to be able to carry these dynamic format
strings. The trace_printk() macro will use the appropriate implementation
for each callsite. Most of the time however, the binary implementation will
still be used.

The other impact of this patch is that mmiotrace_printk() will use the old
implementation because it calls the low level trace_vprintk and we can't
guess here whether the format passed in it is dynamic or not.

Some parts of this patch have been written by Steven Rostedt (most notably
the part that chooses the appropriate implementation for each callsites).

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:15:00 -04:00
Steven Rostedt
db526ca329 tracing: show that buffer size is not expanded
Impact: do not confuse user on small trace buffer sizes

When the system boots up, the trace buffer is small to conserve memory.
It is only two pages per online CPU. When the tracer is used, it expands
to the default value.

This can confuse the user if they look at the buffer size and see only
7, but then later they see 1408.

 # cat /debug/tracing/buffer_size_kb
7

 # echo sched_switch > /debug/tracing/current_tracer

 # cat /debug/tracing/buffer_size_kb
1408

This patch tries to help remove this confustion by showing that the
buffer has not been expanded.

 # cat /debug/tracing/buffer_size_kb
7 (expanded: 1408)

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:14:59 -04:00
Steven Rostedt
8aabee573d ring-buffer: remove unneeded get_online_cpus
Impact: speed up and remove possible races

The get_online_cpus was added to the ring buffer because the original
design would free the ring buffer on a CPU that was being taken
off line. The final design kept the ring buffer around even when the
CPU was taken off line. This is to allow a user to still read the
information on that ring buffer.

Most of the get_online_cpus are no longer needed since the ring buffer will
not disappear from the use cases.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:14:59 -04:00
Steven Rostedt
59222efe2d ring-buffer: use CONFIG_HOTPLUG_CPU not CONFIG_HOTPLUG
The hotplug code in the ring buffers is for use with CPU hotplug,
not generic hotplug.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:14:59 -04:00
Steven Rostedt
1027fcb206 tracing: protect ring_buffer_expanded with trace_types_lock
Impact: prevent races with ring_buffer_expanded

This patch places the expanding of the tracing buffer under the
protection of the trace_types_lock mutex. It is highly unlikely
that there would be any contention, but better safe than sorry.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:14:58 -04:00
Steven Rostedt
a123c52b46 tracing: fix comments about trace buffer resizing
Impact: cleanup

Some of the comments about the trace buffer resizing is gobbledygook.
And I wonder why people question if I'm a native English speaker.

This patch makes the comments make a bit more sense.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-12 21:14:58 -04:00
Ingo Molnar
25d500067d Merge branch 'linus' into core/ipi 2009-03-13 02:14:25 +01:00
Steven Rostedt
51b643b404 Merge branch 'tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/ftrace-merge 2009-03-12 21:12:46 -04:00
Ingo Molnar
480c93df5b Merge branch 'core/locking' into tracing/ftrace 2009-03-13 01:33:21 +01:00
Ingo Molnar
d820ac4c2f locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]
Impact: cleanup

The naming clashes with upcoming softirq tracepoints, so rename the
APIs to lockdep_*().

Requested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 01:32:36 +01:00
Ingo Molnar
3c1f67d60e Merge branch 'linus' into core/locking 2009-03-13 01:29:17 +01:00
Darren Hart
f061d35150 futex: remove the pointer math from double_unlock_hb
Impact: simplify code

I mistakenly included the pointer value ordering in the
double_unlock_hb() in my previous patch. It's only necessary
in the double_lock_hb() function. This patch removes it.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090312221118.11146.68610.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 01:15:46 +01:00
Magnus Damm
eb53b4e8fe irq: export remove_irq() and setup_irq() symbols
Export the setup_irq() and remove_irq() symbols.

I'd like to export these functions since I have timer
code that needs to use setup_irq() early on (too early
for request_irq()), and the same code can also be
compiled as a module.

Signed-off-by: Magnus Damm <damm@igel.co.jp>
LKML-Reference: <20090312120559.2926.82371.sendpatchset@rx1.opensource.se>
[ changed to _GPL as these are special APIs deep inside the irq layer. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 13:16:33 +01:00
Magnus Damm
cbf94f0682 irq: match remove_irq() args with setup_irq()
Modify remove_irq() to match setup_irq().

Signed-off-by: Magnus Damm <damm@igel.co.jp>
LKML-Reference: <20090312120551.2926.43942.sendpatchset@rx1.opensource.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 13:16:33 +01:00
Magnus Damm
f21cfb258d irq: add remove_irq() for freeing of setup_irq() irqs
Impact: add new API

This patch adds a remove_irq() function for releasing
interrupts requested with setup_irq().

Without this patch we have no way of releasing such
interrupts since free_irq() today tries to kfree()
the irqaction passed with setup_irq().

Signed-off-by: Magnus Damm <damm@igel.co.jp>
LKML-Reference: <20090312120542.2926.56609.sendpatchset@rx1.opensource.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 13:16:32 +01:00
Ingo Molnar
f8cb22cbb8 Merge branch 'linus' into irq/genirq 2009-03-12 13:16:18 +01:00
Darren Hart
e4dc5b7a36 futex: clean up fault logic
Impact: cleanup

Older versions of the futex code held the mmap_sem which had to
be dropped in order to call get_user(), so a two-pronged fault
handling mechanism was employed to handle faults of the atomic
operations.  The mmap_sem is no longer held, so get_user()
should be adequate.  This patch greatly simplifies the logic and
improves legibility.

Build and boot tested on a 4 way Intel x86_64 workstation.
Passes basic pthread_mutex and PI tests out of
ltp/testcases/realtime.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090312075612.9856.48612.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 11:20:57 +01:00
Darren Hart
e8f6386c01 futex: unlock before returning -EFAULT
Impact: rt-mutex failure case fix

futex_lock_pi can potentially return -EFAULT with the rt_mutex
held.  This seems like the wrong thing to do as userspace should
assume -EFAULT means the lock was not taken.  Even if it could
figure this out, we'd be leaving the pi_state->owner in an
inconsistent state.  This patch unlocks the rt_mutex prior to
returning -EFAULT to userspace.

Build and boot tested on a 4 way Intel x86_64 workstation.
Passes basic pthread_mutex and PI tests out of
ltp/testcases/realtime.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090312075606.9856.88729.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 11:20:57 +01:00
Darren Hart
16f4993f4e futex: use current->time_slack_ns for rt tasks too
RT tasks should set their timer slack to 0 on their own.  This
patch removes the 'if (rt_task()) slack = 0;' block in
futex_wait.

Build and boot tested on a 4 way Intel x86_64 workstation.
Passes basic pthread_mutex and PI tests out of
ltp/testcases/realtime.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Arjan van de Ven <arjan@linux.intel.com>
LKML-Reference: <20090312075559.9856.28822.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 11:20:57 +01:00
Darren Hart
5eb3dc62fc futex: add double_unlock_hb()
Impact: cleanup

The futex code uses double_lock_hb() which locks the hb->lock's
in pointer value order.  There is no parallel unlock routine,
and the code unlocks them in name order, ignoring pointer value.

This patch adds double_unlock_hb() to refactor the duplicated
code segments.

Build and boot tested on a 4 way Intel x86_64 workstation.
Passes basic pthread_mutex and PI tests out of
ltp/testcases/realtime.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090312075552.9856.48021.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 11:20:56 +01:00
Darren Hart
de87fcc124 futex: additional (get|put)_futex_key() fixes
Impact: fix races

futex_requeue and futex_lock_pi still had some bad
(get|put)_futex_key() usage. This patch adds the missing
put_futex_keys() and corrects a goto in futex_lock_pi() to avoid
a double get.

Build and boot tested on a 4 way Intel x86_64 workstation.
Passes basic pthread_mutex and PI tests out of
ltp/testcases/realtime.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090312075545.9856.75152.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 11:20:56 +01:00
Darren Hart
b2d0994b13 futex: update futex commentary
Impact: cleanup

The futex_hash_bucket can be a bit confusing when first looking
at the code as it is a shared queue (and futex_q isn't a queue
at all, but rather an element on the queue).

The mmap_sem is no longer held outside of the
futex_handle_fault() routine, yet numerous comments refer to it.
The fshared argument is no an integer.  I left some of these
comments along as they are simply removed in future patches.

Some of the commentary refering to futexes by virtual page
mappings was not very clear, and completely accurate (as for
shared futexes both the page and the offset are used to
determine the key).  For the purposes of the function
description, just referring to "the futex" seems sufficient.

With hashed futexes we now access the page after the hash-bucket
is locked, and not only after it is enqueued.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <20090312075537.9856.29954.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-12 11:20:55 +01:00
Steven Rostedt
554f786e28 ring-buffer: only allocate buffers for online cpus
Impact: save on memory

Currently, a ring buffer was allocated for each "possible_cpus". On
some systems, this is the same as NR_CPUS. Thus, if a system defined
NR_CPUS = 64 but it only had 1 CPU, we could have possibly 63 useless
ring buffers taking up space. With a default buffer of 3 megs, this
could be quite drastic.

This patch changes the ring buffer code to only allocate ring buffers
for online CPUs.  If a CPU goes off line, we do not free the buffer.
This is because the user may still have trace data in that buffer
that they would like to look at.

Perhaps in the future we could add code to delete a ring buffer if
the CPU is offline and the ring buffer becomes empty.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-11 22:15:27 -04:00
Steven Rostedt
9aba60fe6e tracing: fix trace_wait to know to wait on all cpus or just one
Impact: fix to task live locking on reading trace_pipe on one CPU

The same code is used for both trace_pipe (all CPUS) and the per_cpu
trace_pipe file. When there is no data to read, it will check for
signals and wait on the trace wait queue.

The problem happens with the per_cpu wait. The trace_wait code checks
all CPUs. Thus, if there's data in another CPU buffer, then it will
exit the wait, without checking for signals or waiting on the wait queue.

It would then try to read the empty buffer, and since that will just
return nothing, then it will try to wait again. Unfortunately, that will
again fail due to there still being data in the other buffers. This
ends up with a live lock for the task.

This patch fixes the trace_wait to be aware that the iterator may only
be waiting on a single buffer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-11 22:15:25 -04:00
Steven Rostedt
1852fcce18 tracing: expand the ring buffers when an event is activated
To save memory, the tracer ring buffers are set to a minimum.
The activating of a trace expands the ring buffer size. This patch
adds this expanding, when an event is activated.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-11 22:15:24 -04:00
Steven Rostedt
73c5162aa3 tracing: keep ring buffer to minimum size till used
Impact: less memory impact on systems not using tracer

When the kernel boots up that has tracing configured, it allocates
the default size of the ring buffer. This currently happens to be
1.4Megs per possible CPU. This is quite a bit of wasted memory if
the system is never using the tracer.

The current solution is to keep the ring buffers to a minimum size
until the user uses them. Once a tracer is piped into the current_tracer
the ring buffer will be expanded to the default size. If the user
changes the size of the ring buffer, it will take the size given
by the user immediately.

If the user adds a "ftrace=" to the kernel command line, then the ring
buffers will be set to the default size on initialization.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-11 22:15:22 -04:00
Ingo Molnar
aecfcde920 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-11 20:47:23 +01:00
Mike Galbraith
df1c99d416 sched: add avg_overlap decay
Impact: more precise avg_overlap metric - better load-balancing

avg_overlap is used to measure the runtime overlap of the waker and
wakee.

However, when a process changes behaviour, eg a pipe becomes
un-congested and we don't need to go to sleep after a wakeup
for a while, the avg_overlap value grows stale.

When running we use the avg runtime between preemption as a
measure for avg_overlap since the amount of runtime can be
correlated to cache footprint.

The longer we run, the less likely we'll be wanting to be
migrated to another CPU.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236709131.25234.576.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-11 11:31:50 +01:00
Ingo Molnar
78b020d035 Merge branches 'x86/cleanups', 'x86/kexec', 'x86/mce2' and 'linus' into x86/core 2009-03-11 10:49:15 +01:00
Benjamin Herrenschmidt
e14eee56c2 Merge commit 'origin/master' into next 2009-03-11 17:10:07 +11:00
Dhaval Giani
be50b8342d kernel/user.c: fix a memory leak when freeing up non-init usernamespaces users
We were returning early in the sysfs directory cleanup function if the
user belonged to a non init usernamespace.  Due to this a lot of the
cleanup was not done and we were left with a leak.  Fix the leak.

Reported-by: Serge Hallyn <serue@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-10 15:55:11 -07:00
Ingo Molnar
e2b8b28085 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-10 22:55:31 +01:00
Ingo Molnar
4dd163a051 Merge branches 'tracing/ftrace', 'tracing/textedit' and 'linus' into tracing/core 2009-03-10 22:54:23 +01:00
Steven Rostedt
80370cb758 tracing: use raw spinlocks for trace_vprintk
Impact: prevent locking up by lockdep tracer

The lockdep tracer uses trace_vprintk and thus trace_vprintk can not
call back into lockdep without locking up.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 17:16:35 -04:00
Peter Zijlstra
6cc3c6e12b trace_clock: fix preemption bug
Using the function_graph tracer in recent kernels generates a spew of
preemption BUGs. Fix this by not requiring trace_clock_local() users
to disable preemption themselves.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10 20:03:01 +01:00
Steven Rostedt
ef18012b24 tracing: remove funky whitespace in the trace code
Impact: clean up

There existed a lot of <space><tab>'s in the tracing code. This
patch removes them.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 14:13:14 -04:00
Steven Rostedt
0e3d0f0566 tracing: update comments to match event code macros
Impact: clean up / comments

The comments that described the ftrace macros to manipulate the
TRACE_EVENT and TRACE_FORMAT macros no longer match the code.
This patch updates them.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 13:12:58 -04:00
Steven Rostedt
30a8fecc2d tracing: flip the TP_printk and TP_fast_assign in the TRACE_EVENT macro
Impact: clean up

In trying to stay consistant with the C style format in the TRACE_EVENT
macro, it makes more sense to do the printk after the assigning of
the variables.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 12:41:38 -04:00
Steven Rostedt
2314c4ae14 tracing: add back the available_events file
The event directory files type and available_types were no longer
needed with the new TRACE_EVENT_FORMAT macros, they were deleted.
But by accident the available_events file was also removed.
This patch brings it back.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 12:04:02 -04:00
Peter Zijlstra
57310a98a3 sched: optimize ttwu vs group scheduling
Impact: micro-optimization

We can avoid the sched domain walk on try_to_wake_up() when we know
there are no groups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236603381.8389.455.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10 16:43:36 +01:00
Ingo Molnar
8c54436ae9 Merge branches 'sched/cleanups' and 'linus' into sched/core 2009-03-10 16:34:43 +01:00
Steven Rostedt
40e26815fa tracing: do not allow modifying the ftrace events via the event files
Impact: fix to prevent crash on calling NULL function pointer

The ftrace internal records have their format exported via the event
system under the ftrace subsystem. These are only for exporting the
format to allow binary readers to be able to parse them in a binary
output.

The ftrace subsystem events can only be enabled via the ftrace tracers
and do not have a registering function. The event files expect the
event record to have registering function and will call it directly.
Passing in a ftrace subsystem event will cause the kernel to crash
because it will execute a NULL pointer.

This patch prevents the ftrace subsystem from being viewable to the
event enabling files.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 11:32:40 -04:00
Steven Rostedt
ce8eb2bf05 tracing: fix printk format specifier
Impact: clean up

The offsetof and sizeof are of type size_t, and instead of typecasting
them to unsigned int for printk formatting, one could just use %zu.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 10:14:35 -04:00
Américo Wang
bbb76b552a ptrace: remove a useless goto
Impact: cleanup

Obviously, this goto is useless. Remove it.

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Roland McGrath <roland@redhat.com>
LKML-Reference: <20090310093447.GC3179@hack>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10 13:39:29 +01:00
KOSAKI Motohiro
bbcd306359 tracing: Don't assume possible cpu list have continuous numbers
"for (++cpu ; cpu < num_possible_cpus(); cpu++)" statement assumes
possible cpus have continuous number - but that's a wrong assumption.

Insted, cpumask_next() should be used.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20090310104437.A480.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10 10:20:30 +01:00
Ingo Molnar
8293dd6f86 Merge branch 'x86/core' into tracing/ftrace
Semantic merge:

  kernel/trace/trace_functions_graph.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-10 10:17:48 +01:00
Ingo Molnar
9a1043d19c Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-10 09:57:16 +01:00
Ingo Molnar
12e87e36e0 Merge branches 'tracing/doc', 'tracing/ftrace', 'tracing/printk' and 'linus' into tracing/core 2009-03-10 09:56:25 +01:00
Ingo Molnar
467c88fee5 Merge branches 'x86/apic', 'x86/asm', 'x86/fixmap', 'x86/memtest', 'x86/mm', 'x86/urgent', 'linus' and 'core/percpu' into x86/core 2009-03-10 09:26:38 +01:00
Steven Rostedt
157587d7ac tracing: remove obsolete TRACE_EVENT_FORMAT macro
Impact: clean up

The TRACE_EVENT_FORMAT macro is no longer used by trace points
and only the DECLARE_TRACE, TRACE_FORMAT or TRACE_EVENT macros should
be used by them. Although the TRACE_EVENT_FORMAT macro is still used
by the internal tracing utility, it should not be used in core
kernel code.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 00:35:12 -04:00
Steven Rostedt
da4d03020c tracing: new format for specialized trace points
Impact: clean up and enhancement

The TRACE_EVENT_FORMAT macro looks quite ugly and is limited in its
ability to save data as well as to print the record out. Working with
Ingo Molnar, we came up with a new format that is much more pleasing to
the eye of C developers. This new macro is more C style than the old
macro, and is more obvious to what it does.

Here's the example. The only updated macro in this patch is the
sched_switch trace point.

The old method looked like this:

 TRACE_EVENT_FORMAT(sched_switch,
        TP_PROTO(struct rq *rq, struct task_struct *prev,
                struct task_struct *next),
        TP_ARGS(rq, prev, next),
        TP_FMT("task %s:%d ==> %s:%d",
              prev->comm, prev->pid, next->comm, next->pid),
        TRACE_STRUCT(
                TRACE_FIELD(pid_t, prev_pid, prev->pid)
                TRACE_FIELD(int, prev_prio, prev->prio)
                TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
                                    next_comm,
                                    TP_CMD(memcpy(TRACE_ENTRY->next_comm,
                                                 next->comm,
                                                 TASK_COMM_LEN)))
                TRACE_FIELD(pid_t, next_pid, next->pid)
                TRACE_FIELD(int, next_prio, next->prio)
        ),
        TP_RAW_FMT("prev %d:%d ==> next %s:%d:%d")
        );

The above method is hard to read and requires two format fields.

The new method:

 /*
  * Tracepoint for task switches, performed by the scheduler:
  *
  * (NOTE: the 'rq' argument is not used by generic trace events,
  *        but used by the latency tracer plugin. )
  */
 TRACE_EVENT(sched_switch,

	TP_PROTO(struct rq *rq, struct task_struct *prev,
		 struct task_struct *next),

	TP_ARGS(rq, prev, next),

	TP_STRUCT__entry(
		__array(	char,	prev_comm,	TASK_COMM_LEN	)
		__field(	pid_t,	prev_pid			)
		__field(	int,	prev_prio			)
		__array(	char,	next_comm,	TASK_COMM_LEN	)
		__field(	pid_t,	next_pid			)
		__field(	int,	next_prio			)
	),

	TP_printk("task %s:%d [%d] ==> %s:%d [%d]",
		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
		__entry->next_comm, __entry->next_pid, __entry->next_prio),

	TP_fast_assign(
		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
		__entry->prev_pid	= prev->pid;
		__entry->prev_prio	= prev->prio;
		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
		__entry->next_pid	= next->pid;
		__entry->next_prio	= next->prio;
	)
 );

This macro is called TRACE_EVENT, it is broken up into 5 parts:

 TP_PROTO:        the proto type of the trace point
 TP_ARGS:         the arguments of the trace point
 TP_STRUCT_entry: the structure layout of the entry in the ring buffer
 TP_printk:       the printk format
 TP_fast_assign:  the method used to write the entry into the ring buffer

The structure is the definition of how the event will be saved in the
ring buffer. The printk is used by the internal tracing in case of
an oops, and the kernel needs to print out the format of the record
to the console. This the TP_printk gives a means to show the records
in a human readable format. It is also used to print out the data
from the trace file.

The TP_fast_assign is executed directly. It is basically like a C function,
where the __entry is the handle to the record.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 00:35:07 -04:00
Steven Rostedt
9cc26a261d tracing: use generic __stringify
Impact: clean up

This removes the custom made STR(x) macros in the tracer and uses
the generic __stringify macro instead.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 00:35:05 -04:00
Steven Rostedt
2939b0469d tracing: replace TP<var> with TP_<var>
Impact: clean up

The macros TPPROTO, TPARGS, TPFMT, TPRAWFMT, and TPCMD all look a bit
ugly. This patch adds an underscore to their names.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 00:35:04 -04:00
Steven Rostedt
156b5f172a tracing: typecast sizeof and offsetof to unsigned int
Impact: fix compiler warnings

On x86_64 sizeof and offsetof are treated as long, where as on x86_32
they are int. This patch typecasts them to unsigned int to avoid
one arch giving warnings while the other does not.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-10 00:34:03 -04:00
Oleg Nesterov
2d5516cbb9 copy_process: fix CLONE_PARENT && parent_exec_id interaction
CLONE_PARENT can fool the ->self_exec_id/parent_exec_id logic. If we
re-use the old parent, we must also re-use ->parent_exec_id to make
sure exit_notify() sees the right ->xxx_exec_id's when the CLONE_PARENT'ed
task exits.

Also, move down the "p->parent_exec_id = p->self_exec_id" thing, to place
two different cases together.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-09 13:23:25 -07:00
Heiko Carstens
6d5b5acca9 Fix fixpoint divide exception in acct_update_integrals
Frans Pop reported the crash below when running an s390 kernel under Hercules:

  Kernel BUG at 000738b4  verbose debug info unavailable!
  fixpoint divide exception: 0009  #1! SMP
  Modules linked in: nfs lockd nfs_acl sunrpc ctcm fsm tape_34xx
     cu3088 tape ccwgroup tape_class ext3 jbd mbcache dm_mirror dm_log dm_snapshot
     dm_mod dasd_eckd_mod dasd_mod
  CPU: 0 Not tainted 2.6.27.19 #13
  Process awk (pid: 2069, task: 0f9ed9b8, ksp: 0f4f7d18)
  Krnl PSW : 070c1000 800738b4 (acct_update_integrals+0x4c/0x118)
             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0
  Krnl GPRS: 00000000 000007d0 7fffffff fffff830
             00000000 ffffffff 00000002 0f9ed9b8
             00000000 00008ca0 00000000 0f9ed9b8
             0f9edda4 8007386e 0f4f7ec8 0f4f7e98
  Krnl Code: 800738aa: a71807d0         lhi     %r1,2000
             800738ae: 8c200001         srdl    %r2,1
             800738b2: 1d21             dr      %r2,%r1
            >800738b4: 5810d10e         l       %r1,270(%r13)
             800738b8: 1823             lr      %r2,%r3
             800738ba: 4130f060         la      %r3,96(%r15)
             800738be: 0de1             basr    %r14,%r1
             800738c0: 5800f060         l       %r0,96(%r15)
  Call Trace:
  ( <000000000004fdea>! blocking_notifier_call_chain+0x1e/0x2c)
    <0000000000038502>! do_exit+0x106/0x7c0
    <0000000000038c36>! do_group_exit+0x7a/0xb4
    <0000000000038c8e>! SyS_exit_group+0x1e/0x30
    <0000000000021c28>! sysc_do_restart+0x12/0x16
    <0000000077e7e924>! 0x77e7e924

Reason for this is that cpu time accounting usually only happens from
interrupt context, but acct_update_integrals gets also called from
process context with interrupts enabled.

So in acct_update_integrals we may end up with the following scenario:

Between reading tsk->stime/tsk->utime and tsk->acct_timexpd an interrupt
happens which updates accouting values.  This causes acct_timexpd to be
greater than the former stime + utime.  The subsequent calculation of

	dtime = cputime_sub(time, tsk->acct_timexpd);

will be negative and the division performed by

	cputime_to_jiffies(dtime)

will generate an exception since the result won't fit into a 32 bit
register.

In order to fix this just always disable interrupts while accessing any
of the accounting values.

Reported by: Frans Pop <elendil@planet.nl>
Tested by: Frans Pop <elendil@planet.nl>
Cc: stable@kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-09 08:13:35 -07:00
KOSAKI Motohiro
c3ffc7a40b tracing: Don't use tracing_record_cmdline() in workqueue tracer
Impact: improve workqueue tracer output

Currently, /sys/kernel/debug/tracing/trace_stat/workqueues can display
wrong and strange thread names.

Why?

Currently, ftrace has tracing_record_cmdline()/trace_find_cmdline()
convenience function that implements a task->comm string cache.

This can avoid unnecessary memcpy overhead and the workqueue tracer
uses it.

However, in general, any trace statistics feature shouldn't use
tracing_record_cmdline() because trace statistics can display
very old process. Then comm cache can return wrong string because
recent process overrides the cache.

Fortunately, workqueue trace guarantees that displayed processes
are live. Thus we can search comm string from PID at display time.

<before>

% cat workqueues
 # CPU  INSERTED  EXECUTED   NAME
 # |      |         |          |

   7 431913     431913       kondemand/7
   7      0          0       tail
   7     21         21       git
   7      0          0       ls
   7      9          9       cat
   7 832632     832632       unix_chkpwd
   7 236292     236292       ls

Note: tail, git, ls, cat unix_chkpwd are obiously not workqueue thread.

<after>

% cat workqueues
 # CPU  INSERTED  EXECUTED   NAME
 # |      |         |          |

   7    510        510       kondemand/7
   7      0          0       kmpathd/7
   7     15         15       ata/7
   7      0          0       aio/7
   7     11         11       kblockd/7
   7   1063       1063       work_on_cpu/7
   7    167        167       events/7

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-09 10:26:13 +01:00
KOSAKI Motohiro
888b55dc31 ftrace: tracing header should put '#' at the beginning of a line
In a recent discussion, Andrew Morton pointed out that tracing header
should put '#' at the beginning of a line.

Then, we can easily filtered the header by following grep usage:

  cat trace | grep -v '^#'

Wakeup trace also has the same header problem.

Comparison of headers displayed:

before this patch:

 # tracer: wakeup
 #
 wakeup latency trace v1.1.5 on 2.6.29-rc7-tip-tip
 --------------------------------------------------------------------
  latency: 19059 us, #21277/21277, CPU#1 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
     -----------------
     | task: kondemand/1-1644 (uid:0 nice:-5 policy:0 rt_prio:0)
     -----------------

 #                  _------=> CPU#
 #                 / _-----=> irqs-off
 #                | / _----=> need-resched
 #                || / _---=> hardirq/softirq
 #                ||| / _--=> preempt-depth
 #                |||| /
 #                |||||     delay
 #  cmd     pid   ||||| time  |   caller
 #     \   /      |||||   \   |   /
 irqbalan-1887    1d.s.    0us :   1887:120:R   + [001]  1644:115:S kondemand/1
 irqbalan-1887    1d.s.    1us : default_wake_function <-autoremove_wake_function
 irqbalan-1887    1d.s.    2us : check_preempt_wakeup <-try_to_wake_up

after this patch:

 # tracer: wakeup
 #
 # wakeup latency trace v1.1.5 on 2.6.29-rc7-tip-tip
 # --------------------------------------------------------------------
 # latency: 529 us, #530/530, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
 #    -----------------
 #    | task: kondemand/0-1641 (uid:0 nice:-5 policy:0 rt_prio:0)
 #    -----------------
 #
 #                  _------=> CPU#
 #                 / _-----=> irqs-off
 #                | / _----=> need-resched
 #                || / _---=> hardirq/softirq
 #                ||| / _--=> preempt-depth
 #                |||| /
 #                |||||     delay
 #  cmd     pid   ||||| time  |   caller
 #     \   /      |||||   \   |   /
     sshd-2496    0d.s.    0us :   2496:120:R   + [000]  1641:115:S kondemand/0
     sshd-2496    0d.s.    1us : default_wake_function <-autoremove_wake_function
     sshd-2496    0d.s.    1us : check_preempt_wakeup <-try_to_wake_up

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090308124421.23C3.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-08 16:57:22 +01:00
Ingo Molnar
dba58e39ce Merge branches 'tracing/doc', 'tracing/ftrace', 'tracing/printk' and 'tracing/textedit' into tracing/core 2009-03-08 16:48:51 +01:00
Ingo Molnar
9de36825b3 tracing: trace_bprintk() cleanups
Impact: cleanup

Remove a few leftovers and clean up the code a bit.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-06 17:59:12 +01:00
Frederic Weisbecker
769b0441f4 tracing/core: drop the old trace_printk() implementation in favour of trace_bprintk()
Impact: faster and lighter tracing

Now that we have trace_bprintk() which is faster and consume lesser
memory than trace_printk() and has the same purpose, we can now drop
the old implementation in favour of the binary one from trace_bprintk(),
which means we move all the implementation of trace_bprintk() to
trace_printk(), so the Api doesn't change except that we must now use
trace_seq_bprintk() to print the TRACE_PRINT entries.

Some changes result of this:

- Previously, trace_bprintk depended of a single tracer and couldn't
  work without. This tracer has been dropped and the whole implementation
  of trace_printk() (like the module formats management) is now integrated
  in the tracing core (comes with CONFIG_TRACING), though we keep the file
  trace_printk (previously trace_bprintk.c) where we can find the module
  management. Thus we don't overflow trace.c

- changes some parts to use trace_seq_bprintk() to print TRACE_PRINT entries.

- change a bit trace_printk/trace_vprintk macros to support non-builtin formats
  constants, and fix 'const' qualifiers warnings. But this is all transparent for
  developers.

- etc...

V2:

- Rebase against last changes
- Fix mispell on the changelog

V3:

- Rebase against last changes (moving trace_printk() to kernel.h)

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1236356510-8381-5-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-06 17:59:12 +01:00