Commit Graph

10648 Commits

Author SHA1 Message Date
Linus Torvalds
a0fe3cc5d3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits)
  Input: psmouse - small formatting changes to better follow coding style
  Input: synaptics - set dimensions as reported by firmware
  Input: elantech - relax signature checks
  Input: elantech - enforce common prefix on messages
  Input: wistron_btns - switch to using kmemdup()
  Input: usbtouchscreen - switch to using kmemdup()
  Input: do not force selecting i8042 on Moorestown
  Input: Documentation/sysrq.txt - update KEY_SYSRQ info
  Input: 88pm860x_onkey - remove invalid irq number assignment
  Input: i8042 - add a PNP entry to the aux device list
  Input: i8042 - add some extra PNP keyboard types
  Input: wm9712 - fix wm97xx_set_gpio() logic
  Input: add keypad driver for keys interfaced to TCA6416
  Input: remove obsolete {corgi,spitz,tosa}kbd.c
  Input: kbtab - do not advertise unsupported events
  Input: kbtab - simplify kbtab_disconnect()
  Input: kbtab - fix incorrect size parameter in usb_buffer_free
  Input: acecad - don't advertise mouse events
  Input: acecad - fix some formatting issues
  Input: acecad - simplify usb_acecad_disconnect()
  ...

Trivial conflict in Documentation/feature-removal-schedule.txt
2010-05-20 10:33:06 -07:00
Linus Torvalds
f39d01be4c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
  vlynq: make whole Kconfig-menu dependant on architecture
  add descriptive comment for TIF_MEMDIE task flag declaration.
  EEPROM: max6875: Header file cleanup
  EEPROM: 93cx6: Header file cleanup
  EEPROM: Header file cleanup
  agp: use NULL instead of 0 when pointer is needed
  rtc-v3020: make bitfield unsigned
  PCI: make bitfield unsigned
  jbd2: use NULL instead of 0 when pointer is needed
  cciss: fix shadows sparse warning
  doc: inode uses a mutex instead of a semaphore.
  uml: i386: Avoid redefinition of NR_syscalls
  fix "seperate" typos in comments
  cocbalt_lcdfb: correct sections
  doc: Change urls for sparse
  Powerpc: wii: Fix typo in comment
  i2o: cleanup some exit paths
  Documentation/: it's -> its where appropriate
  UML: Fix compiler warning due to missing task_struct declaration
  UML: add kernel.h include to signal.c
  ...
2010-05-20 09:20:59 -07:00
Linus Torvalds
46ee964509 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: PM QOS update fix
  Freezer / cgroup freezer: Update stale locking comments
  PM / platform_bus: Allow runtime PM by default
  i2c: Fix bus-level power management callbacks
  PM QOS update
  PM / Hibernate: Fix block_io.c printk warning
  PM / Hibernate: Group swap ops
  PM / Hibernate: Move the first_sector out of swsusp_write
  PM / Hibernate: Separate block_io
  PM / Hibernate: Snapshot cleanup
  FS / libfs: Implement simple_write_to_buffer
  PM / Hibernate: document open(/dev/snapshot) side effects
  PM / Runtime: Add sysfs debug files
  PM: Improve device power management document
  PM: Update device power management document
  PM: Allow runtime_suspend methods to call pm_schedule_suspend()
  PM: pm_wakeup - switch to using bool
2010-05-20 09:03:55 -07:00
Linus Torvalds
fa5312d9e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: change cancel_work_sync() to clear work->data
  workqueue: warn about flush_scheduled_work()
2010-05-20 09:03:02 -07:00
Linus Torvalds
96b5b7f4f2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (61 commits)
  KEYS: Return more accurate error codes
  LSM: Add __init to fixup function.
  TOMOYO: Add pathname grouping support.
  ima: remove ACPI dependency
  TPM: ACPI/PNP dependency removal
  security/selinux/ss: Use kstrdup
  TOMOYO: Use stack memory for pending entry.
  Revert "ima: remove ACPI dependency"
  Revert "TPM: ACPI/PNP dependency removal"
  KEYS: Do preallocation for __key_link()
  TOMOYO: Use mutex_lock_interruptible.
  KEYS: Better handling of errors from construct_alloc_key()
  KEYS: keyring_serialise_link_sem is only needed for keyring->keyring links
  TOMOYO: Use GFP_NOFS rather than GFP_KERNEL.
  ima: remove ACPI dependency
  TPM: ACPI/PNP dependency removal
  selinux: generalize disabling of execmem for plt-in-heap archs
  LSM Audit: rename LSM_AUDIT_NO_AUDIT to LSM_AUDIT_DATA_NONE
  CRED: Holding a spinlock does not imply the holding of RCU read lock
  SMACK: Don't #include Ext2 headers
  ...
2010-05-20 08:55:50 -07:00
Ingo Molnar
dfacc4d6c9 Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core 2010-05-20 14:38:55 +02:00
Frederic Weisbecker
49f135ed02 perf: Comply with new rcu checks API
The software events hlist doesn't fully comply with the new
rcu checks api.

We need to consider three different sides that access the hlist:

- the hlist allocation/release side. This side happens when an
  events is created or released, accesses to the hlist are
  serialized under the cpuctx mutex.

- the events insertion/removal in the hlist. This side is always
  serialized against the above one. The hlist is always present
  during such operations. This side happens when a software event
  is scheduled in/out. The serialization that ensures the software
  event is really attached to the context is made under the
  ctx->lock.

- events triggering. This is the read side, it can happen
  concurrently with any update side.

This patch deals with them one by one and anticipates with the
separate rcu mem space patches in preparation.

This patch fixes various annoying rcu warnings.

Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
2010-05-20 10:40:37 +02:00
Linus Torvalds
164d44fd92 Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  clocksource: Add clocksource_register_hz/khz interface
  posix-cpu-timers: Optimize run_posix_cpu_timers()
  time: Remove xtime_cache
  mqueue: Convert message queue timeout to use hrtimers
  hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME
  timers: Introduce the concept of timer slack for legacy timers
  ntp: Remove tickadj
  ntp: Make time_adjust static
  time: Add xtime, wall_to_monotonic to feature-removal-schedule
  timer: Try to survive timer callback preempt_count leak
  timer: Split out timer function call
  timer: Print function name for timer callbacks modifying preemption count
  time: Clean up warp_clock()
  cpu-timers: Avoid iterating over all threads in fastpath_timer_check()
  cpu-timers: Change SIGEV_NONE timer implementation
  cpu-timers: Return correct previous timer reload value
  cpu-timers: Cleanup arm_timer()
  cpu-timers: Simplify RLIMIT_CPU handling
2010-05-19 17:11:10 -07:00
Linus Torvalds
6e0b7b2c39 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Clear CPU mask in affinity_hint when none is provided
  genirq: Add CPU mask affinity hint
  genirq: Remove IRQF_DISABLED from core code
  genirq: Run irq handlers with interrupts disabled
  genirq: Introduce request_any_context_irq()
  genirq: Expose irq_desc->node in proc/irq

Fixed up trivial conflicts in Documentation/feature-removal-schedule.txt
2010-05-19 17:09:40 -07:00
KOSAKI Motohiro
fa9dc265ac cpumask: fix compat getaffinity
Commit a45185d2d "cpumask: convert kernel/compat.c" broke libnuma, which
abuses sched_getaffinity to find out NR_CPUS in order to parse
/sys/devices/system/node/node*/cpumap.

On NUMA systems with less than 32 possibly CPUs, the current
compat_sys_sched_getaffinity now returns '4' instead of the actual
NR_CPUS/8, which makes libnuma bail out when parsing the cpumap.

The libnuma call sched_getaffinity(0, bitmap, 4096) at first.  It mean
the libnuma expect the return value of sched_getaffinity() is either len
argument or NR_CPUS.  But it doesn't expect to return nr_cpu_ids.

Strictly speaking, userland requirement are

1) Glibc assume the return value mean the lengh of initialized
   of mask argument. E.g. if sched_getaffinity(1024) return 128,
   glibc make zero fill rest 896 byte.
2) Libnuma assume the return value can be used to guess NR_CPUS
   in kernel. It assume len-arg<NR_CPUS makes -EINVAL. But
   it try len=4096 at first and 4096 is always bigger than
   NR_CPUS. Then, if we remove strange min_length normalization,
   we never hit -EINVAL case.

sched_getaffinity() already solved this issue.  This patch adapts
compat_sys_sched_getaffinity() to match the non-compat case.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Ken Werner <ken.werner@web.de>
Cc: stable@kernel.org
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-19 11:48:18 -07:00
Linus Torvalds
ba0234ec35 Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (24 commits)
  [S390] drivers/s390/char: Use kmemdup
  [S390] drivers/s390/char: Use kstrdup
  [S390] debug: enable exception-trace debug facility
  [S390] s390_hypfs: Add new attributes
  [S390] qdio: remove API wrappers
  [S390] qdio: set correct bit in dsci
  [S390] qdio: dont convert timestamps to microseconds
  [S390] qdio: remove memset hack
  [S390] qdio: prevent starvation on PCI devices
  [S390] qdio: count number of qdio interrupts
  [S390] user space fault: report fault before calling do_exit
  [S390] topology: expose core identifier
  [S390] dasd: remove uid from devmap
  [S390] dasd: add dynamic pav toleration
  [S390] vdso: add missing vdso_install target
  [S390] vdso: remove redundant check for CONFIG_64BIT
  [S390] avoid default_llseek in s390 drivers
  [S390] vmcp: disallow modular build
  [S390] add breaking event address for user space
  [S390] virtualization aware cpu measurement
  ...
2010-05-19 11:35:30 -07:00
Dmitry Torokhov
8d0bc2b456 Merge commit 'v2.6.34' into next 2010-05-19 10:12:41 -07:00
Don Zickus
26e09c6eee lockup_detector: Convert per_cpu to __get_cpu_var for readability
Just a bunch of conversions as suggested by Frederic W.
__get_cpu_var() provides preemption disabled checks.

Plus it gives more readability as it makes it obvious
we are dealing locally now with these vars.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
LKML-Reference: <1274133966-18415-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-19 11:32:14 +02:00
Rusty Russell
480b02df3a module: drop the lock while waiting for module to complete initialization.
This fixes "gave up waiting for init of module libcrc32c." which
happened at boot time due to multiple parallel module loads.

The problem was a deadlock: we wait for a module to finish
initializing, but we keep the module_lock mutex so it can't complete.
In particular, this could reasonably happen if a module does a
request_module() in its initialization routine.

So we change use_module() to return an errno rather than a bool, and if
it's -EBUSY we drop the lock and wait in the caller, then reaquire the
lock.

Reported-by: Brandon Philips <brandon@ifup.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Brandon Philips <brandon@ifup.org>
2010-05-19 17:33:39 +09:30
Ben Hutchings
92946bc72f panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I')
This taint flag will initially be used when warning about invalid ACPI
DMAR tables.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2010-05-19 08:37:43 +01:00
Ben Hutchings
b2be05273a panic: Allow warnings to set different taint flags
WARN() is used in some places to report firmware or hardware bugs that
are then worked-around.  These bugs do not affect the stability of the
kernel and should not set the flag for TAINT_WARN.  To allow for this,
add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint number
as argument.

Architectures that implement warnings using trap instructions instead
of calls to warn_slowpath_*() now implement __WARN_TAINT(taint)
instead of __WARN().

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Helge Deller <deller@gmx.de>
Tested-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2010-05-19 08:36:48 +01:00
Tony Breeds
fd6be105b8 mutex: Fix optimistic spinning vs. BKL
Currently, we can hit a nasty case with optimistic
spinning on mutexes:

    CPU A tries to take a mutex, while holding the BKL

    CPU B tried to take the BLK while holding the mutex

This looks like a AB-BA scenario but in practice, is
allowed and happens due to the auto-release on
schedule() nature of the BKL.

In that case, the optimistic spinning code can get us
into a situation where instead of going to sleep, A
will spin waiting for B who is spinning waiting for
A, and the only way out of that loop is the
need_resched() test in mutex_spin_on_owner().

This patch fixes it by completely disabling spinning
if we own the BKL. This adds one more detail to the
extensive list of reasons why it's a bad idea for
kernel code to be holding the BKL.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org>
LKML-Reference: <20100519054636.GC12389@ozlabs.org>
[ added an unlikely() attribute to the branch ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-19 08:18:44 +02:00
David S. Miller
2ec8c6bb5d Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	include/linux/mod_devicetable.h
	scripts/mod/file2alias.c
2010-05-18 23:01:55 -07:00
Steffen Klassert
3789ae7dcd padata: Use get_online_cpus/put_online_cpus in padata_free
Add get_online_cpus/put_online_cpus to ensure that no cpu goes
offline during the flushing of the padata percpu queues.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-19 13:45:35 +10:00
Steffen Klassert
0198ffd135 padata: Add some code comments
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-19 13:44:27 +10:00
Steffen Klassert
2b73b07ab8 padata: Flush the padata queues actively
yield was used to wait until all references of the internal control
structure in use are dropped before it is freed. This patch implements
padata_flush_queues which actively flushes the padata percpu queues
in this case.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-19 13:43:46 +10:00
Steffen Klassert
d46a5ac7a7 padata: Use a timer to handle remaining objects in the reorder queues
padata_get_next needs to check whether the next object that
need serialization must be parallel processed by the local cpu.
This check was wrong implemented and returned always true,
so the try_again loop in padata_reorder was never taken. This
can lead to object leaks in some rare cases due to a race that
appears with the trylock in padata_reorder. The try_again loop
was not a good idea after all, because a cpu could take that
loop frequently, so we handle this with a timer instead.

This patch adds a timer to handle the race that appears with
the trylock. If cpu1 queues an object to the reorder queue while
cpu2 holds the pd->lock but left the while loop in padata_reorder
already, cpu2 can't care for this object and cpu1 exits because
it can't get the lock. Usually the next cpu that takes the lock
cares for this object too. We need the timer just if this object
was the last one that arrives to the reorder queues. The timer
function sends it out in this case.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-19 13:43:14 +10:00
Peter Zijlstra
6d1acfd5c6 perf: Optimize perf_output_*() by avoiding local_xchg()
Since the x86 XCHG ins implies LOCK, avoid the use by
using a sequence count instead.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 18:35:49 +02:00
Peter Zijlstra
fa5881514e perf: Optimize the hotpath by converting the perf output buffer to local_t
Since there is now only a single writer, we can use
local_t instead and avoid all these pesky LOCK insn.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 18:35:49 +02:00
Peter Zijlstra
ef60777c9a perf: Optimize the perf_output() path by removing IRQ-disables
Since we can now assume there is only a single writer
to each buffer, we can remove per-cpu lock thingy and
use a simply nest-count to the same effect.

This removes the need to disable IRQs.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 18:35:48 +02:00
Peter Zijlstra
c7920614ce perf: Disallow mmap() on per-task inherited events
Since we now have working per-task-per-cpu events for
a while, disallow mmap() on per-task inherited
events. Those things were a performance problem
anyway, and doing away with it allows us to optimize
the buffer somewhat by assuming there is only a
single writer.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 18:35:48 +02:00
Peter Zijlstra
a19d35c11f perf: Optimize buffer placement by allocating buffers NUMA aware
Ensure cpu bound buffers live on the right NUMA node.

Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1274114880.5605.5236.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 18:35:47 +02:00
Stephane Eranian
00d1d0b095 perf: Fix errors path in perf_output_begin()
In case the sampling buffer has no "payload" pages,
nr_pages is 0. The problem is that the error path in
perf_output_begin() skips to a label which assumes
perf_output_lock() has been issued which is not the
case. That triggers a WARN_ON() in
perf_output_unlock().

This patch fixes the problem by skipping
perf_output_unlock() in case data->nr_pages is 0.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4bf13674.014fd80a.6c82.ffffb20c@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 18:35:47 +02:00
Peter Zijlstra
4f41c013f5 perf/ftrace: Optimize perf/tracepoint interaction for single events
When we've got but a single event per tracepoint
there is no reason to try and multiplex it so don't.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 18:35:46 +02:00
Linus Torvalds
752f114fb8 Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Fix "integer as NULL pointer" warning.
  tracing: Fix tracepoint.h DECLARE_TRACE() to allow more than one header
  tracing: Make the documentation clear on trace_event boot option
  ring-buffer: Wrap open-coded WARN_ONCE
  tracing: Convert nop macros to static inlines
  tracing: Fix sleep time function profiling
  tracing: Show sample std dev in function profiling
  tracing: Add documentation for trace commands mod, traceon/traceoff
  ring-buffer: Make benchmark handle missed events
  ring-buffer: Make non-consuming read less expensive with lots of cpus.
  tracing: Add graph output support for irqsoff tracer
  tracing: Have graph flags passed in to ouput functions
  tracing: Add ftrace events for graph tracer
  tracing: Dump either the oops's cpu source or all cpus buffers
  tracing: Fix uninitialized variable of tracing/trace output
2010-05-18 08:35:04 -07:00
Linus Torvalds
b8ae30ee26 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
  stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
  sched, wait: Use wrapper functions
  sched: Remove a stale comment
  ondemand: Make the iowait-is-busy time a sysfs tunable
  ondemand: Solve a big performance issue by counting IOWAIT time as busy
  sched: Intoduce get_cpu_iowait_time_us()
  sched: Eliminate the ts->idle_lastupdate field
  sched: Fold updating of the last_update_time_info into update_ts_time_stats()
  sched: Update the idle statistics in get_cpu_idle_time_us()
  sched: Introduce a function to update the idle statistics
  sched: Add a comment to get_cpu_idle_time_us()
  cpu_stop: add dummy implementation for UP
  sched: Remove rq argument to the tracepoints
  rcu: need barrier() in UP synchronize_sched_expedited()
  sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
  sched: kill paranoia check in synchronize_sched_expedited()
  sched: replace migration_thread with cpu_stop
  stop_machine: reimplement using cpu_stop
  cpu_stop: implement stop_cpu[s]()
  sched: Fix select_idle_sibling() logic in select_task_rq_fair()
  ...
2010-05-18 08:27:54 -07:00
Linus Torvalds
4d7b4ac22f Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
  perf tools: Add mode to build without newt support
  perf symbols: symbol inconsistency message should be done only at verbose=1
  perf tui: Add explicit -lslang option
  perf options: Type check all the remaining OPT_ variants
  perf options: Type check OPT_BOOLEAN and fix the offenders
  perf options: Check v type in OPT_U?INTEGER
  perf options: Introduce OPT_UINTEGER
  perf tui: Add workaround for slang < 2.1.4
  perf record: Fix bug mismatch with -c option definition
  perf options: Introduce OPT_U64
  perf tui: Add help window to show key associations
  perf tui: Make <- exit menus too
  perf newt: Add single key shortcuts for zoom into DSO and threads
  perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
  perf newt: Fix the 'A'/'a' shortcut for annotate
  perf newt: Make <- exit the ui_browser
  x86, perf: P4 PMU - fix counters management logic
  perf newt: Make <- zoom out filters
  perf report: Report number of events, not samples
  perf hist: Clarify events_stats fields usage
  ...

Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c
2010-05-18 08:19:03 -07:00
Linus Torvalds
3aaf51ace5 Merge branch 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
  oprofile/x86: make AMD IBS hotplug capable
  oprofile/x86: notify cpus only when daemon is running
  oprofile/x86: reordering some functions
  oprofile/x86: stop disabled counters in nmi handler
  oprofile/x86: protect cpu hotplug sections
  oprofile/x86: remove CONFIG_SMP macros
  oprofile/x86: fix uninitialized counter usage during cpu hotplug
  oprofile/x86: remove duplicate IBS capability check
  oprofile/x86: move IBS code
  oprofile/x86: return -EBUSY if counters are already reserved
  oprofile/x86: moving shutdown functions
  oprofile/x86: reserve counter msrs pairwise
  oprofile/x86: rework error handler in nmi_setup()
  oprofile: update file list in MAINTAINERS file
  oprofile: protect from not being in an IRQ context
  oprofile: remove double ring buffering
  ring-buffer: Add lost event count to end of sub buffer
  tracing: Show the lost events in the trace_pipe output
  ring-buffer: Add place holder recording of dropped events
  tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set
  ...
2010-05-18 08:18:07 -07:00
Linus Torvalds
f262af3d08 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
  rcu: remove all rcu head initializations, except on_stack initializations
  rcu head introduce rcu head init on stack
  Debugobjects transition check
  rcu: fix build bug in RCU_FAST_NO_HZ builds
  rcu: RCU_FAST_NO_HZ must check RCU dyntick state
  rcu: make SRCU usable in modules
  rcu: improve the RCU CPU-stall warning documentation
  rcu: reduce the number of spurious RCU_SOFTIRQ invocations
  rcu: permit discontiguous cpu_possible_mask CPU numbering
  rcu: improve RCU CPU stall-warning messages
  rcu: print boot-time console messages if RCU configs out of ordinary
  rcu: disable CPU stall warnings upon panic
  rcu: enable CPU_STALL_VERBOSE by default
  rcu: slim down rcutiny by removing rcu_scheduler_active and friends
  rcu: refactor RCU's context-switch handling
  rcu: rename rcutiny rcu_ctrlblk to rcu_sched_ctrlblk
  rcu: shrink rcutiny by making synchronize_rcu_bh() be inline
  rcu: fix now-bogus rcu_scheduler_active comments.
  rcu: Fix bogus CONFIG_PROVE_LOCKING in comments to reflect reality.
  rcu: ignore offline CPUs in last non-dyntick-idle CPU check
  ...
2010-05-18 08:17:58 -07:00
Linus Torvalds
1014cfe2fb Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: Reduce stack_trace usage
  lockdep: No need to disable preemption in debug atomic ops
  lockdep: Actually _dec_ in debug_atomic_dec
  lockdep: Provide off case for redundant_hardirqs_on increment
  lockdep: Simplify debug atomic ops
  lockdep: Fix redundant_hardirqs_on incremented with irqs enabled
  lockstat: Make lockstat counting per cpu
  i8253: Convert i8253_lock to raw_spinlock
2010-05-18 08:17:35 -07:00
James Bottomley
95bb335c0e [SCSI] Merge scsi-misc-2.6 into scsi-rc-fixes-2.6
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2010-05-18 10:37:41 -04:00
Steven Rostedt
f0218b3e99 Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-6
Conflicts:
	include/trace/ftrace.h
	kernel/trace/trace_kprobe.c

Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-18 00:35:23 -04:00
James Morris
539c99fd7f Merge branch 'next' into for-linus 2010-05-18 08:57:00 +10:00
Ingo Molnar
9c6f7e43b4 stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
This addresses the following compiler warning:

 kernel/stop_machine.c: In function 'cpu_stop_cpu_callback':
 kernel/stop_machine.c:297: warning: unused variable 'work'

Cc: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <tip-3fc1f1e27a5b807791d72e5d992aa33b668a6626@git.kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 00:17:44 +02:00
Linus Torvalds
3d2c978e0c Merge branch 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  ptrace: Cleanup useless header
  ptrace: kill BKL in ptrace syscall
2010-05-17 13:48:10 -07:00
Heiko Carstens
ab3c68ee5f [S390] debug: enable exception-trace debug facility
The exception-trace facility on x86 and other architectures prints
traces to dmesg whenever a user space application crashes.
s390 has such a feature since ages however it is called
userprocess_debug and is enabled differently.
This patch makes sure that whenever one of the two procfs files

/proc/sys/kernel/userprocess_debug
/proc/sys/debug/exception-trace

is modified the contents of the second one changes as well.
That way we keep backwards compatibilty but also support the same
interface like other architectures do.
Besides that the output of the traces is improved since it will now
also contain the corresponding filename of the vma (when available)
where the process caused a fault or trap.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-05-17 10:00:17 +02:00
Mark Gross
25f3a5a285 PM: PM QOS update fix
This update handles a use case where pm_qos update requests need to
silently fail if the update is being sent to a handle that is NULL.

The problem was that the original pm_qos silently fails when a request
update is passed to a parameter that has not been added to the list yet.
This update restores that behavior.

Signed-off-by: markgross <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-17 00:21:03 +02:00
Octavian Purdila
9f977fb7ae sysctl: add proc_do_large_bitmap
The new function can be used to read/write large bitmaps via /proc. A
comma separated range format is used for compact output and input
(e.g. 1,3-4,10-10).

Writing into the file will first reset the bitmap then update it
based on the given input.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-15 23:28:39 -07:00
Amerigo Wang
00b7c3395a sysctl: refactor integer handling proc code
(Based on Octavian's work, and I modified a lot.)

As we are about to add another integer handling proc function a little
bit of cleanup is in order: add a few helper functions to improve code
readability and decrease code duplication.

In the process a bug is also fixed: if the user specifies a number
with more then 20 digits it will be interpreted as two integers
(e.g. 10000...13 will be interpreted as 100.... and 13).

Behavior for EFAULT handling was changed as well. Previous to this
patch, when an EFAULT error occurred in the middle of a write
operation, although some of the elements were set, that was not
acknowledged to the user (by shorting the write and returning the
number of bytes accepted). EFAULT is now treated just like any other
errors by acknowledging the amount of bytes accepted.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-15 23:28:38 -07:00
Don Zickus
cafcd80d21 lockup_detector: Cross arch compile fixes
Combining the softlockup and hardlockup code causes watchdog.c
to build even without the hardlockup detection support.

So if an arch, that has the previous and the new nmi watchdog
implementations cohabiting, wants to know if the generic one
is in use, CONFIG_LOCKUP_DETECTOR is not a reliable check.
We need to use CONFIG_HARDLOCKUP_DETECTOR instead.

Fixes:
	kernel/built-in.o: In function `touch_nmi_watchdog':
	(.text+0x449bc): multiple definition of `touch_nmi_watchdog'
	arch/sparc/kernel/built-in.o:(.text+0x11b28): first defined here

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
LKML-Reference: <20100514151121.GR15159@redhat.com>
[ use CONFIG_HARDLOCKUP_DETECTOR instead of CONFIG_PERF_EVENTS_NMI]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-16 04:25:14 +02:00
Frederic Weisbecker
23637d477c lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR
This new config is deemed to simplify even more the lockup detector
dependencies and can make it easier to bring a smooth sorting
between archs that support the new generic lockup detector and those
that still have their own, especially for those that are in the
middle of this migration.

Instead of checking whether we have CONFIG_LOCKUP_DETECTOR +
CONFIG_PERF_EVENTS_NMI each time an arch wants to know if it needs
to build its own lockup detector, take a shortcut with this new
config. It is enabled only if the hardlockup detection part of
the whole lockup detector is on.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
2010-05-16 01:57:42 +02:00
Hugh Dickins
16a2164bb0 profile: fix stats and data leakage
If the kernel is large or the profiling step small, /proc/profile
leaks data and readprofile shows silly stats, until readprofile -r
has reset the buffer: clear the prof_buffer when it is vmalloc()ed.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-14 19:45:06 -07:00
Li Zefan
e1f7992e01 tracing: Fix function declarations if !CONFIG_STACKTRACE
ftrace_trace_stack() and frace_trace_userstacke() take a
struct ring_buffer argument, not struct trace_array. Commit
e77405ad("tracing: pass around ring buffer instead of tracer")
made this change.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BE77C14.5010806@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:33:24 -04:00
Steven Rostedt
553552ce17 tracing: Combine event filter_active and enable into single flags field
The filter_active and enable both use an int (4 bytes each) to
set a single flag. We can save 4 bytes per event by combining the
two into a single integer.

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4894944	1018052	 861512	6774508	 675eec	vmlinux.id
4894871	1012292	 861512	6768675	 674823	vmlinux.flags

This gives us another 5K in savings.

The modification of both the enable and filter fields are done
under the event_mutex, so it is still safe to combine the two.

Note: Although Mathieu gave his Acked-by, he would like it documented
 that the reads of flags are not protected by the mutex. The way the
 code works, these reads will not break anything, but will have a
 residual effect. Since this behavior is the same even before this
 patch, describing this situation is left to another patch, as this
 patch does not change the behavior, but just brought it to Mathieu's
 attention.

v2: Updated the event trace self test to for this change.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:33:22 -04:00
Steven Rostedt
32c0edaeaa tracing: Remove duplicate id information in event structure
Now that the trace_event structure is embedded in the ftrace_event_call
structure, there is no need for the ftrace_event_call id field.
The id field is the same as the trace_event type field.

Removing the id and re-arranging the structure brings down the tracepoint
footprint by another 5K.

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4895024	1023812	 861512	6780348	 6775bc	vmlinux.print
4894944	1018052	 861512	6774508	 675eec	vmlinux.id

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:33:15 -04:00
Steven Rostedt
80decc70af tracing: Move print functions into event class
Currently, every event has its own trace_event structure. This is
fine since the structure is needed anyway. But the print function
structure (trace_event_functions) is now separate. Since the output
of the trace event is done by the class (with the exception of events
defined by DEFINE_EVENT_PRINT), it makes sense to have the class
define the print functions that all events in the class can use.

This makes a bigger deal with the syscall events since all syscall events
use the same class. The savings here is another 30K.

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4900382	1048964	 861512	6810858	 67ecea	vmlinux.init
4900446	1049028	 861512	6810986	 67ed6a	vmlinux.preprint
4895024	1023812	 861512	6780348	 6775bc	vmlinux.print

To accomplish this, and to let the class know what event is being
printed, the event structure is embedded in the ftrace_event_call
structure. This should not be an issues since the event structure
was created for each event anyway.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:20:34 -04:00
Steven Rostedt
a9a5776380 tracing: Allow events to share their print functions
Multiple events may use the same method to print their data.
Instead of having all events have a pointer to their print funtions,
the trace_event structure now points to a trace_event_functions structure
that will hold the way to print ouf the event.

The event itself is now passed to the print function to let the print
function know what kind of event it should print.

This opens the door to consolidating the way several events print
their output.

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4900382	1048964	 861512	6810858	 67ecea	vmlinux.init
4900446	1049028	 861512	6810986	 67ed6a	vmlinux.preprint

This change slightly increases the size but is needed for the next change.

v3: Fix the branch tracer events to handle this change.

v2: Fix the new function graph tracer event calls to handle this change.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:20:32 -04:00
Steven Rostedt
0405ab80aa tracing: Move raw_init from events to class
The raw_init function pointer in the event is used to initialize
various kinds of events. The type of initialization needed is usually
classed to the kind of event it is.

Two events with the same class will always have the same initialization
function, so it makes sense to move this to the class structure.

Perhaps even making a special system structure would work since
the initialization is the same for all events within a system.
But since there's no system structure (yet), this will just move it
to the class.

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4900375	1053380	 861512	6815267	 67fe23	vmlinux.fields
4900382	1048964	 861512	6810858	 67ecea	vmlinux.init

The text grew very slightly, but this is a constant growth that happened
with the changing of the C files that call the init code.
The bigger savings is the data which will be saved the more events share
a class.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:20:30 -04:00
Steven Rostedt
2e33af0295 tracing: Move fields from event to class structure
Move the defined fields from the event to the class structure.
Since the fields of the event are defined by the class they belong
to, it makes sense to have the class hold the information instead
of the individual events. The events of the same class would just
hold duplicate information.

After this change the size of the kernel dropped another 3K:

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4900252	1057412	 861512	6819176	 680d68	vmlinux.regs
4900375	1053380	 861512	6815267	 67fe23	vmlinux.fields

Although the text increased, this was mainly due to the C files
having to adapt to the change. This is a constant increase, where
new tracepoints will not increase the Text. But the big drop is
in the data size (as well as needed allocations to hold the fields).
This will give even more savings as more tracepoints are created.

Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS()
with several DEFINE_EVENT()s, then the savings will be lost. But
we are pushing developers to consolidate events with DEFINE_EVENT()
so this should not be an issue.

The kprobes define a unique class to every new event, but are dynamic
so it should not be a issue.

The syscalls however have a single class but the fields for the individual
events are different. The syscalls use a metadata to define the
fields. I moved the fields list from the event to the metadata and
added a "get_fields()" function to the class. This function is used
to find the fields. For normal events and kprobes, get_fields() just
returns a pointer to the fields list_head in the class. For syscall
events, it returns the fields list_head in the metadata for the event.

v2:  Fixed the syscall fields. The syscall metadata needs a list
     of fields for both enter and exit.

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:20:23 -04:00
Steven Rostedt
2239291aeb tracing: Remove per event trace registering
This patch removes the register functions of TRACE_EVENT() to enable
and disable tracepoints. The registering of a event is now down
directly in the trace_events.c file. The tracepoint_probe_register()
is now called directly.

The prototypes are no longer type checked, but this should not be
an issue since the tracepoints are created automatically by the
macros. If a prototype is incorrect in the TRACE_EVENT() macro, then
other macros will catch it.

The trace_event_class structure now holds the probes to be called
by the callbacks. This removes needing to have each event have
a separate pointer for the probe.

To handle kprobes and syscalls, since they register probes in a
different manner, a "reg" field is added to the ftrace_event_class
structure. If the "reg" field is assigned, then it will be called for
enabling and disabling of the probe for either ftrace or perf. To let
the reg function know what is happening, a new enum (trace_reg) is
created that has the type of control that is needed.

With this new rework, the 82 kernel events and 618 syscall events
has their footprint dramatically lowered:

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4914025	1088868	 861512	6864405	 68be15	vmlinux.class
4918492	1084612	 861512	6864616	 68bee8	vmlinux.tracepoint
4900252	1057412	 861512	6819176	 680d68	vmlinux.regs

The size went from 6863829 to 6819176, that's a total of 44K
in savings. With tracepoints being continuously added, this is
critical that the footprint becomes minimal.

v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf
    specific structure in trace_events.c.

v4: Fixed trace self tests to check probe because regfunc no longer
    exists.

v3: Updated to handle void *data in beginning of probe parameters.
    Also added the tracepoint: check_trace_callback_type_##call().

v2: Changed the callback probes to pass void * and typecast the
    value within the function.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 14:19:14 -04:00
Steven Rostedt
38516ab59f tracing: Let tracepoints have data passed to tracepoint callbacks
This patch adds data to be passed to tracepoint callbacks.

The created functions from DECLARE_TRACE() now need a mandatory data
parameter. For example:

DECLARE_TRACE(mytracepoint, int value, value)

Will create the register function:

int register_trace_mytracepoint((void(*)(void *data, int value))probe,
                                void *data);

As the first argument, all callbacks (probes) must take a (void *data)
parameter. So a callback for the above tracepoint will look like:

void myprobe(void *data, int value)
{
}

The callback may choose to ignore the data parameter.

This change allows callbacks to register a private data pointer along
with the function probe.

	void mycallback(void *data, int value);

	register_trace_mytracepoint(mycallback, mydata);

Then the mycallback() will receive the "mydata" as the first parameter
before the args.

A more detailed example:

  DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));

  /* In the C file */

  DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));

  [...]

       trace_mytracepoint(status);

  /* In a file registering this tracepoint */

  int my_callback(void *data, int status)
  {
	struct my_struct my_data = data;
	[...]
  }

  [...]
	my_data = kmalloc(sizeof(*my_data), GFP_KERNEL);
	init_my_data(my_data);
	register_trace_mytracepoint(my_callback, my_data);

The same callback can also be registered to the same tracepoint as long
as the data registered is different. Note, the data must also be used
to unregister the callback:

	unregister_trace_mytracepoint(my_callback, my_data);

Because of the data parameter, tracepoints declared this way can not have
no args. That is:

  DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS());

will cause an error.

If no arguments are needed, a new macro can be used instead:

  DECLARE_TRACE_NOARGS(mytracepoint);

Since there are no arguments, the proto and args fields are left out.

This is part of a series to make the tracepoint footprint smaller:

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4914025	1088868	 861512	6864405	 68be15	vmlinux.class
4918492	1084612	 861512	6864616	 68bee8	vmlinux.tracepoint

Again, this patch also increases the size of the kernel, but
lays the ground work for decreasing it.

 v5: Fixed net/core/drop_monitor.c to handle these updates.

 v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the
     #ifdef CONFIG_TRACE_POINTS, since the two are the same in both
     cases. The __DECLARE_TRACE() is what changes.
     Thanks to Frederic Weisbecker for pointing this out.

 v3: Made all register_* functions require data to be passed and
     all callbacks to take a void * parameter as its first argument.
     This makes the calling functions comply with C standards.

     Also added more comments to the modifications of DECLARE_TRACE().

 v2: Made the DECLARE_TRACE() have the ability to pass arguments
     and added a new DECLARE_TRACE_NOARGS() for tracepoints that
     do not need any arguments.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 09:50:34 -04:00
Steven Rostedt
8f08201830 tracing: Create class struct for events
This patch creates a ftrace_event_class struct that event structs point to.
This class struct will be made to hold information to modify the
events. Currently the class struct only holds the events system name.

This patch slightly increases the size, but this change lays the ground work
of other changes to make the footprint of tracepoints smaller.

With 82 standard tracepoints, and 618 system call tracepoints
(two tracepoints per syscall: enter and exit):

   text	   data	    bss	    dec	    hex	filename
4913961	1088356	 861512	6863829	 68bbd5	vmlinux.orig
4914025	1088868	 861512	6864405	 68be15	vmlinux.class

This patch also cleans up some stale comments in ftrace.h.

v2: Fixed missing semi-colon in macro.

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-14 09:33:49 -04:00
Steven Rostedt
23e117fa44 Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-4 2010-05-14 09:29:52 -04:00
Ingo Molnar
0167c78190 watchdog: Export touch_softlockup_watchdog
There are modules that rely on it:

  ERROR: "touch_softlockup_watchdog" [drivers/video/nvidia/nvidiafb.ko] undefined!

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
LKML-Reference: <1273713674-8434-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-13 08:53:33 +02:00
Ingo Molnar
ad56b0797e Merge commit 'v2.6.34-rc7' into tracing/core
Merge reason: Update from -rc5 to -rc7.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-13 08:11:29 +02:00
Don Zickus
d7c547335f lockup_detector: Separate touch_nmi_watchdog code path from touch_watchdog
When I combined the nmi_watchdog (hardlockup) and softlockup code, I
also combined the paths the touch_watchdog and touch_nmi_watchdog took.
This may not be the best idea as pointed out by Frederic W., that the
touch_watchdog case probably should not reset the hardlockup count.

Therefore the patch below falls back to the previous idea of keeping
the touch_nmi_watchdog a superset of the touch_watchdog case.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <1273266711-18706-9-git-send-email-dzickus@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-12 23:55:55 +02:00
Don Zickus
f69bcf60c3 lockup_detector: Remove nmi_watchdog.c file
This file migrated to kernel/watchdog.c and then combined with
kernel/softlockup.c.  As a result kernel/nmi_watchdog.c is no longer
needed.  Just remove it.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <1273266711-18706-5-git-send-email-dzickus@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-12 23:55:46 +02:00
Don Zickus
2508ce1845 lockup_detector: Remove old softlockup code
Now that is no longer compiled or used, just remove it.

Also move some of the code wrapped with DETECT_SOFTLOCKUP to the
LOCKUP_DETECTOR wrappers because that is the code that uses it now.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <1273266711-18706-4-git-send-email-dzickus@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-12 23:55:45 +02:00
Don Zickus
332fbdbca3 lockup_detector: Touch_softlockup cleanups and softlockup_tick removal
Just some code cleanup to make touch_softlockup clearer and remove the
softlockup_tick function as it is no longer needed.

Also remove the /proc softlockup_thres call as it has been changed to
watchdog_thres.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <1273266711-18706-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-12 23:55:43 +02:00
Don Zickus
58687acba5 lockup_detector: Combine nmi_watchdog and softlockup detector
The new nmi_watchdog (which uses the perf event subsystem) is very
similar in structure to the softlockup detector.  Using Ingo's
suggestion, I combined the two functionalities into one file:
kernel/watchdog.c.

Now both the nmi_watchdog (or hardlockup detector) and softlockup
detector sit on top of the perf event subsystem, which is run every
60 seconds or so to see if there are any lockups.

To detect hardlockups, cpus not responding to interrupts, I
implemented an hrtimer that runs 5 times for every perf event
overflow event.  If that stops counting on a cpu, then the cpu is
most likely in trouble.

To detect softlockups, tasks not yielding to the scheduler, I used the
previous kthread idea that now gets kicked every time the hrtimer fires.
If the kthread isn't being scheduled neither is anyone else and the
warning is printed to the console.

I tested this on x86_64 and both the softlockup and hardlockup paths
work.

V2:
- cleaned up the Kconfig and softlockup combination
- surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI
- seperated out the softlockup case from perf event subsystem
- re-arranged the enabling/disabling nmi watchdog from proc space
- added cpumasks for hardlockup failure cases
- removed fallback to soft events if no PMU exists for hard events

V3:
- comment cleanups
- drop support for older softlockup code
- per_cpu cleanups
- completely remove software clock base hardlockup detector
- use per_cpu masking on hard/soft lockup detection
- #ifdef cleanups
- rename config option NMI_WATCHDOG to LOCKUP_DETECTOR
- documentation additions

V4:
- documentation fixes
- convert per_cpu to __get_cpu_var
- powerpc compile fixes

V5:
- split apart warn flags for hard and soft lockups

TODO:
- figure out how to make an arch-agnostic clock2cycles call
  (if possible) to feed into perf events as a sample period

[fweisbec: merged conflict patch]

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-12 23:55:33 +02:00
Frederic Weisbecker
a9aa1d02de Merge commit 'v2.6.34-rc7' into perf/nmi
Merge reason: catch up with latest softlockup detector changes.
2010-05-12 23:20:33 +02:00
Peter P Waskiewicz Jr
4308ad8011 genirq: Clear CPU mask in affinity_hint when none is provided
When an interrupt is disabled and torn down, the CPU mask returned
through affinity_hint right now is all CPUs. Also, for drivers that
don't provide an affinity_hint mask, this can be misleading. There
should be no hint at all, meaning an empty CPU mask.

[ tglx: use zalloc_cpumask_var instead of clearing it under the lock ]

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Cc: davem@davemloft.net
Cc: arjan@linux.jf.intel.com
Cc: bhutchings@solarflare.com
LKML-Reference: <20100505205638.5426.87189.stgit@ppwaskie-hc2.jf.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-12 11:23:34 +02:00
David S. Miller
278554bd65 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/wireless/ath/ar9170/usb.c
	drivers/scsi/iscsi_tcp.c
	net/ipv4/ipmr.c
2010-05-12 00:05:35 -07:00
KAMEZAWA Hiroyuki
747388d78a memcg: fix css_is_ancestor() RCU locking
Some callers (in memcontrol.c) calls css_is_ancestor() without
rcu_read_lock.  Because css_is_ancestor() has to access RCU protected
data, it should be under rcu_read_lock().

This makes css_is_ancestor() itself does safe access to RCU protected
area.  (At least, "root" can have refcnt==0 if it's not an ancestor of
"child".  So, we need rcu_read_lock().)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11 17:33:42 -07:00
KAMEZAWA Hiroyuki
7f0f154641 memcg: fix css_id() RCU locking for real
Commit ad4ba37537 ("memcg: css_id() must be
called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
message.  But Andrew Morton pointed out that the fix doesn't seems sane
and it was just for hidining lockdep messages.

This is a patch for do proper things.  Checking again, all places,
accessing without rcu_read_lock, that commit fixies was intentional....
all callers of css_id() has reference count on it.  So, it's not necessary
to be under rcu_read_lock().

Considering again, we can use rcu_dereference_check for css_id().  We know
css->id is valid if css->refcnt > 0.  (css->id never changes and freed
after css->refcnt going to be 0.)

This patch makes use of rcu_dereference_check() in css_id/depth and remove
unnecessary rcu-read-lock added by the commit.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11 17:33:42 -07:00
Vitaliy Gusev
11cad320a4 bsdacct: use del_timer_sync() in acct_exit_ns()
acct_exit_ns --> acct_file_reopen deletes timer without check timer
execution on other CPUs.  So acct_timeout() can change an unmapped memory.

Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11 17:33:42 -07:00
Vitaly Mayatskikh
475f9aa6aa kexec: fix OOPS in crash_kernel_shrink
Two "echo 0 > /sys/kernel/kexec_crash_size" OOPSes kernel.  Also content
of this file is invalid after first shrink to zero: it shows 1 instead of
0.

This scenario is unlikely to happen often (root privs, valid crashkernel=
in cmdline, dump-capture kernel not loaded), I hit it only by chance.

This patch fixes it.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11 17:33:42 -07:00
Robin Holt
34441427aa revert "procfs: provide stack information for threads" and its fixup commits
Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.

Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.

Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.

Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.

Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.

This patch nearly undoes the effect of all these patches.

The reason for reverting these is it provides an unusable value in
field 28.  For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address.  This unpredictability of the stack_start value makes
it worthless.  That includes the intended use of showing how much stack
space a thread has.

Other architectures will get different values.  As an example, ia64
gets 0.  The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.

I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") .  If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured.  Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.

I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") .  I left the KSTK_ESP() change in
place as that seemed worthwhile.

Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11 17:33:41 -07:00
Paul E. McKenney
72d5a9f7a9 rcu: remove all rcu head initializations, except on_stack initializations
Remove all rcu head inits. We don't care about the RCU head state before passing
it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can
keep track of objects on stack.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-11 16:10:47 -07:00
Alan Cox
8b6d043b7e resource: shared I/O region support
SuperIO devices share regions and use lock/unlock operations to chip
select.  We therefore need to be able to request a resource and wait for
it to be freed by whichever other SuperIO device currently hogs it.
Right now you have to poll which is horrible.

Add a MUXED field to IO port resources. If the MUXED field is set on the
resource and on the request (via request_muxed_region) then we block
until the previous owner of the muxed resource releases their region.

This allows us to implement proper resource sharing and locking for
superio chips using code of the form

enable_my_superio_dev() {
	request_muxed_region(0x44, 0x02, "superio:watchdog");
	outb() ..sequence to enable chip
}

disable_my_superio_dev() {
	outb() .. sequence of disable chip
	release_region(0x44, 0x02);
}

Signed-off-by: Giel van Schijndel <me@mortis.eu>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-05-11 12:01:10 -07:00
Changli Gao
a93d2f1744 sched, wait: Use wrapper functions
epoll should not touch flags in wait_queue_t. This patch introduces a new
function __add_wait_queue_exclusive(), for the users, who use wait queue as a
LIFO queue.

__add_wait_queue_tail_exclusive() is introduced too instead of
add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
it is a duplicate of __remove_wait_queue(), disliked by users, and with less
users.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <containers@lists.linux-foundation.org>
LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-11 17:43:58 +02:00
Peter Zijlstra
96c21a460a perf: Fix exit() vs event-groups
Corey reported that the value scale times of group siblings are not
updated when the monitored task dies.

The problem appears to be that we only update the group leader's
time values, fix it by updating the whole group.

Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org> # .34.x
LKML-Reference: <1273588935.1810.6.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-11 17:08:24 +02:00
Peter Zijlstra
050735b08c perf: Fix exit() vs PERF_FORMAT_GROUP
Both Stephane and Corey reported that PERF_FORMAT_GROUP didn't
work as expected if the task the counters were attached to quit
before the read() call.

The cause is that we unconditionally destroy the grouping when
we remove counters from their context. Fix this by splitting off
the group destroy from the list removal such that
perf_event_remove_from_context() does not do this and change
perf_event_release() to do so.

Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: <stable@kernel.org> # .34.x
LKML-Reference: <1273571513.5605.3527.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-11 15:46:43 +02:00
Ingo Molnar
e3174cfd2a Revert "perf: Fix exit() vs PERF_FORMAT_GROUP"
This reverts commit 4fd38e4595.

It causes various crashes and hangs when events are activated.

The cause is not fully understood yet but we need to revert it
because the effects are severe.

Reported-by: Stephane Eranian <eranian@google.com>
Reported-by: Lin Ming <ming.m.lin@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-11 08:31:49 +02:00
Matt Helsley
8f77578cc2 Freezer / cgroup freezer: Update stale locking comments
Update stale comments regarding locking order and add a little more detail
so it's easier to follow the locking between the cgroup freezer and the
power management freezer code.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-10 23:18:47 +02:00
Mark Gross
ed77134bfc PM QOS update
This patch changes the string based list management to a handle base
implementation to help with the hot path use of pm-qos, it also renames
much of the API to use "request" as opposed to "requirement" that was
used in the initial implementation.  I did this because request more
accurately represents what it actually does.

Also, I added a string based ABI for users wanting to use a string
interface.  So if the user writes 0xDDDDDDDD formatted hex it will be
accepted by the interface.  (someone asked me for it and I don't think
it hurts anything.)

This patch updates some documentation input I got from Randy.

Signed-off-by: markgross <mgross@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-10 23:08:19 +02:00
Randy Dunlap
0fef8b1e83 PM / Hibernate: Fix block_io.c printk warning
Fix printk format warning in block_io.c:

kernel/power/block_io.c:41: warning: format '%ld' expects type 'long int', but argument 2 has type 'sector_t'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-10 23:08:18 +02:00
Jiri Slaby
6f612af578 PM / Hibernate: Group swap ops
Move all the swap processing into one function. It will make swap
calls from a non-swap code easier.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-10 23:08:18 +02:00
Jiri Slaby
51fb352b2c PM / Hibernate: Move the first_sector out of swsusp_write
The first sector knowledge is swap-only specific. Move it into the
swap handle. This will be needed for later non-swap specific code
moving into snapshot.c.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2010-05-10 23:08:18 +02:00
Jiri Slaby
8a0d613fa1 PM / Hibernate: Separate block_io
Move block I/O operations to a separate file. It is because it will
be used later not only by the swap writer.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-10 23:08:18 +02:00
Jiri Slaby
d3c1b24c50 PM / Hibernate: Snapshot cleanup
Remove support of reads with offset. This means snapshot_read/write_next
now does not accept count parameter. It allows to clean up the functions
and snapshot handle which no longer needs to care about offsets.

/dev/snapshot handler is converted to simple_{read_from,write_to}_buffer
which take care of offsets.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-05-10 23:08:17 +02:00
Paul E. McKenney
d822ed1094 rcu: fix build bug in RCU_FAST_NO_HZ builds
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:35 -07:00
Paul E. McKenney
77e38ed347 rcu: RCU_FAST_NO_HZ must check RCU dyntick state
The current version of RCU_FAST_NO_HZ reproduces the old CLASSIC_RCU
dyntick-idle bug, as it fails to detect CPUs that have interrupted
or NMIed out of dyntick-idle mode.  Fix this by making rcu_needs_cpu()
check the state in the per-CPU rcu_dynticks variables, thus correctly
detecting the dyntick-idle state from an RCU perspective.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:35 -07:00
Paul E. McKenney
d21670acab rcu: reduce the number of spurious RCU_SOFTIRQ invocations
Lai Jiangshan noted that up to 10% of the RCU_SOFTIRQ are spurious, and
traced this down to the fact that the current grace-period machinery
will uselessly raise RCU_SOFTIRQ when a given CPU needs to go through
a quiescent state, but has not yet done so.  In this situation, there
might well be nothing that RCU_SOFTIRQ can do, and the overhead can be
worth worrying about in the ksoftirqd case.  This patch therefore avoids
raising RCU_SOFTIRQ in this situation.

Changes since v1 (http://lkml.org/lkml/2010/3/30/122 from Lai Jiangshan):

o	Omit the rcu_qs_pending() prechecks, as they aren't that
	much less expensive than the quiescent-state checks.

o	Merge with the set_need_resched() patch that reduces IPIs.

o	Add the new n_rp_report_qs field to the rcu_pending tracing output.

o	Update the tracing documentation accordingly.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:35 -07:00
Paul E. McKenney
4a90a0681c rcu: permit discontiguous cpu_possible_mask CPU numbering
TREE_RCU assumes that CPU numbering is contiguous, but some users need
large holes in the numbering to better map to hardware layout.  This patch
makes TREE_RCU (and TREE_PREEMPT_RCU) tolerate large holes in the CPU
numbering.  However, NR_CPUS must still be greater than the largest
CPU number.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:35 -07:00
Paul E. McKenney
4300aa642c rcu: improve RCU CPU stall-warning messages
The existing RCU CPU stall-warning messages can be confusing, especially
in the case where one CPU detects a single other stalled CPU.  In addition,
the console messages did not say which flavor of RCU detected the stall,
which can make it difficult to work out exactly what is causing the stall.
This commit improves these messages.

Requested-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
26845c2860 rcu: print boot-time console messages if RCU configs out of ordinary
Print boot-time messages if tracing is enabled, if fanout is set
to non-default values, if exact fanout is specified, if accelerated
dyntick-idle grace periods have been enabled, if RCU-lockdep is enabled,
if rcutorture has been boot-time enabled, if the CPU stall detector has
been disabled, or if four-level hierarchy has been enabled.

This is all for TREE_RCU and TREE_PREEMPT_RCU.  TINY_RCU will be handled
separately, if at all.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
c68de2097a rcu: disable CPU stall warnings upon panic
The current RCU CPU stall warnings remain enabled even after a panic
occurs, which some people have found to be a bit counterproductive.
This patch therefore uses a notifier to disable stall warnings once a
panic occurs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
bbad937983 rcu: slim down rcutiny by removing rcu_scheduler_active and friends
TINY_RCU does not need rcu_scheduler_active unless CONFIG_DEBUG_LOCK_ALLOC.
So conditionally compile rcu_scheduler_active in order to slim down
rcutiny a bit more.  Also gets rid of an EXPORT_SYMBOL_GPL, which is
responsible for most of the slimming.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:34 -07:00
Paul E. McKenney
25502a6c13 rcu: refactor RCU's context-switch handling
The addition of preemptible RCU to treercu resulted in a bit of
confusion and inefficiency surrounding the handling of context switches
for RCU-sched and for RCU-preempt.  For RCU-sched, a context switch
is a quiescent state, pure and simple, just like it always has been.
For RCU-preempt, a context switch is in no way a quiescent state, but
special handling is required when a task blocks in an RCU read-side
critical section.

However, the callout from the scheduler and the outer loop in ksoftirqd
still calls something named rcu_sched_qs(), whose name is no longer
accurate.  Furthermore, when rcu_check_callbacks() notes an RCU-sched
quiescent state, it ends up unnecessarily (though harmlessly, aside
from the performance hit) enqueuing the current task if it happens to
be running in an RCU-preempt read-side critical section.  This not only
increases the maximum latency of scheduler_tick(), it also needlessly
increases the overhead of the next outermost rcu_read_unlock() invocation.

This patch addresses this situation by separating the notion of RCU's
context-switch handling from that of RCU-sched's quiescent states.
The context-switch handling is covered by rcu_note_context_switch() in
general and by rcu_preempt_note_context_switch() for preemptible RCU.
This permits rcu_sched_qs() to handle quiescent states and only quiescent
states.  It also reduces the maximum latency of scheduler_tick(), though
probably by much less than a microsecond.  Finally, it means that tasks
within preemptible-RCU read-side critical sections avoid incurring the
overhead of queuing unless there really is a context switch.

Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-05-10 11:08:33 -07:00
Paul E. McKenney
99652b54de rcu: rename rcutiny rcu_ctrlblk to rcu_sched_ctrlblk
Make naming line up in preparation for CONFIG_TINY_PREEMPT_RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:32 -07:00
Paul E. McKenney
da848c47bc rcu: shrink rcutiny by making synchronize_rcu_bh() be inline
Because synchronize_rcu_bh() is identical to synchronize_sched(),
make the former a static inline invoking the latter, saving the
overhead of an EXPORT_SYMBOL_GPL() and the duplicate code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:32 -07:00
Lai Jiangshan
5db356736a rcu: ignore offline CPUs in last non-dyntick-idle CPU check
Offline CPUs are not in nohz_cpu_mask, but can be ignored when checking
for the last non-dyntick-idle CPU.  This patch therefore only checks
online CPUs for not being dyntick idle, allowing fast entry into
full-system dyntick-idle state even when there are some offline CPUs.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
Lai Jiangshan
0c34029abd rcu: move some code from macro to function
Shrink the RCU_INIT_FLAVOR() macro by moving all but the initialization
of the ->rda[] array to rcu_init_one().  The call to rcu_init_one()
can then be moved to the end of the RCU_INIT_FLAVOR() macro, which is
required because rcu_boot_init_percpu_data(), which is now called from
rcu_init_one(), depends on the initialization of the ->rda[] array.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
Lai Jiangshan
f261414f0d rcu: make dead code really dead
cleanup: make dead code really dead

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
Paul E. McKenney
d25eb9442b rcu: substitute set_need_resched for sending resched IPIs
This patch adds a check to __rcu_pending() that does a local
set_need_resched() if the current CPU is holding up the current grace
period and if force_quiescent_state() will be called soon.  The goal is
to reduce the probability that force_quiescent_state() will need to do
smp_send_reschedule(), which sends an IPI and is therefore more expensive
on most architectures.

Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
Lai Jiangshan
2b3fc35f69 rcu: optionally leave lockdep enabled after RCU lockdep splat
There is no need to disable lockdep after an RCU lockdep splat,
so remove the debug_lockdeps_off() from lockdep_rcu_dereference().
To avoid repeated lockdep splats, use a static variable in the inlined
rcu_dereference_check() and rcu_dereference_protected() macros so that
a given instance splats only once, but so that multiple instances can
be detected per boot.

This is controlled by a new config variable CONFIG_PROVE_RCU_REPEATEDLY,
which is disabled by default.  This provides the normal lockdep behavior
by default, but permits people who want to find multiple RCU-lockdep
splats per boot to easily do so.

Requested-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-10 11:08:31 -07:00
John Stultz
d7e81c269d clocksource: Add clocksource_register_hz/khz interface
How to pick good mult/shift pairs has always been difficult to
describe to folks writing clocksource drivers, since it requires
careful tradeoffs in adjustment accuracy vs overflow limits.

Now, with the clocks_calc_mult_shift function, its much
easier. However, not many clocksources have converted to using that
function, and there is still the issue of the max interval length
assumption being made by each clocksource driver independently.

So this patch simplifies the registration process by having
clocksources be registered with a hz/khz value and the registration
function taking care of setting mult/shift.

This should take most of the confusion out of writing a clocksource
driver.

Additionally it also keeps the shift size tradeoff (more accuracy vs
longer possible nohz times) centralized so the timekeeping core can
keep track of the assumptions being made.

[ tglx: Coding style and comments fixed ]

Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1273280858-30143-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-10 14:24:26 +02:00
Stanislaw Gruszka
29f87b793d posix-cpu-timers: Optimize run_posix_cpu_timers()
We can optimize and simplify things taking into account signal->cputimer
is always running when we have configured any process wide cpu timer.

In check_process_timers(), we don't have to check if new updated value of
signal->cputime_expires is smaller, since we maintain new first expiration
time ({prof,virt,sched}_expires) in code flow and all other writes to
expiration cache are protected by sighand->siglock .

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-10 14:24:26 +02:00
Thomas Gleixner
dbb6be6d5e Merge branch 'linus' into timers/core
Reason: Further posix_cpu_timer patches depend on mainline changes

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-10 14:20:42 +02:00
Ingo Molnar
7c224a03a7 Merge commit 'v2.6.34-rc7' into oprofile
Merge reason: Update to Linus's latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-10 13:12:29 +02:00
Li Zefan
af507ae8a0 sched: Remove a stale comment
This comment should have been removed together with uids_mutex
when removing user sched.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dhaval Giani <dhaval.giani@gmail.com>
LKML-Reference: <4BE77C6B.5010402@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-10 08:48:39 +02:00
Arjan van de Ven
0224cf4c5e sched: Intoduce get_cpu_iowait_time_us()
For the ondemand cpufreq governor, it is desired that the iowait
time is microaccounted in a similar way as idle time is.

This patch introduces the infrastructure to account and expose
this information via the get_cpu_iowait_time_us() function.

[akpm@linux-foundation.org: fix CONFIG_NO_HZ=n build]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082523.284feab6@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-09 19:35:27 +02:00
Arjan van de Ven
e0e37c200f sched: Eliminate the ts->idle_lastupdate field
Now that the only user of ts->idle_lastupdate is
update_ts_time_stats(), the entire field can be eliminated.

In update_ts_time_stats(), idle_lastupdate is first set to
"now", and a few lines later, the only user is an if() statement
that assigns a variable either to "now" or to
ts->idle_lastupdate, which has the value of "now" at that point.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082439.2fab0b4f@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-09 19:35:26 +02:00
Arjan van de Ven
8d63bf949e sched: Fold updating of the last_update_time_info into update_ts_time_stats()
This patch folds the updating of the last_update_time into the
update_ts_time_stats() function, and updates the callers.

This allows for further cleanups that are done in the next
patch.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082403.60072967@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-09 19:35:26 +02:00
Arjan van de Ven
8c7b09f43f sched: Update the idle statistics in get_cpu_idle_time_us()
Right now, get_cpu_idle_time_us() only reports the idle
statistics upto the point the CPU entered last idle; not what is
valid right now.

This patch adds an update of the idle statistics to
get_cpu_idle_time_us(), so that calling this function always
returns statistics that are accurate at the point of the call.

This includes resetting the start of the idle time for
accounting purposes to avoid double accounting.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082323.2d2f1945@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-09 19:35:26 +02:00
Arjan van de Ven
595aac488b sched: Introduce a function to update the idle statistics
Currently, two places update the idle statistics (and more to
come later in this series).

This patch creates a helper function for updating these
statistics.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082245.163e67ed@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-09 19:35:25 +02:00
Arjan van de Ven
b1f724c305 sched: Add a comment to get_cpu_idle_time_us()
The exported function get_cpu_idle_time_us() has no comment
describing it; add a kerneldoc comment

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
LKML-Reference: <20100509082208.7cb721f0@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-09 19:35:25 +02:00
Frederic Weisbecker
9313543945 tracing: Drop the nested field from lock_release event
Drop the nested field as we don't use it. Every nested state can
be computed from a state machine on post processing already.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Steven Rostedt <rostedt@goodmis.org>
2010-05-09 13:45:34 +02:00
Frederic Weisbecker
883a2a3189 tracing: Drop lock_acquired waittime field
Drop the waittime field from the lock_acquired event, we can
calculate it by substracting the lock_acquired event timestamp
with the matching lock_acquire one.

It is not needed and takes useless space in the traces.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Steven Rostedt <rostedt@goodmis.org>
2010-05-09 13:45:32 +02:00
Ingo Molnar
e7858f52a5 Merge branch 'cpu_stop' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into sched/core 2010-05-08 18:11:19 +02:00
Masami Hiramatsu
c0614829c1 kprobes: Move enable/disable_kprobe() out from debugfs code
Move enable/disable_kprobe() API out from debugfs related code,
because these interfaces are not related to debugfs interface.

This fixes a compiler warning.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by:  Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
LKML-Reference: <20100427223312.2322.60512.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-08 18:08:30 +02:00
Tejun Heo
bbf1bb3eee cpu_stop: add dummy implementation for UP
When !CONFIG_SMP, cpu_stop functions weren't defined at all which
could lead to build failures if UP code uses cpu_stop facility.  Add
dummy cpu_stop implementation for UP.  The waiting variants execute
the work function directly with preempt disabled and
stop_one_cpu_nowait() schedules a workqueue work.

Makefile and ifdefs around stop_machine implementation are updated to
accomodate CONFIG_SMP && !CONFIG_STOP_MACHINE case.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
2010-05-08 17:12:33 +02:00
Paul Mackerras
6e85158cf5 perf_event: Make software events work again
Commit 6bde9b6ce0 ("perf: Add
group scheduling transactional APIs") added code to allow a
group to be scheduled in a single transaction.  However, it
introduced a bug in handling events whose pmu does not implement
transactions -- at the end of scheduling in the events in the
group, in the non-transactional case the code now falls through
to the group_error label, and proceeds to unschedule all the
events in the group and return failure.

This fixes it by returning 0 (success) in the non-transactional
case.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: eranian@gmail.com
LKML-Reference: <20100508105800.GB10650@brick.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-08 13:16:55 +02:00
Neil Horman
3ee943728f ipv4: remove ip_rt_secret timer (v4)
A while back there was a discussion regarding the rt_secret_interval timer.
Given that we've had the ability to do emergency route cache rebuilds for awhile
now, based on a statistical analysis of the various hash chain lengths in the
cache, the use of the flush timer is somewhat redundant.  This patch removes the
rt_secret_interval sysctl, allowing us to rely solely on the statistical
analysis mechanism to determine the need for route cache flushes.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-08 01:57:52 -07:00
Linus Torvalds
91bc482ec5 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcu: create rcu_my_thread_group_empty() wrapper
  memcg: css_id() must be called under rcu_read_lock()
  cgroup: Check task_lock in task_subsys_state()
  sched: Fix an RCU warning in print_task()
  cgroup: Fix an RCU warning in alloc_css_id()
  cgroup: Fix an RCU warning in cgroup_path()
  KEYS: Fix an RCU warning in the reading of user keys
  KEYS: Fix an RCU warning
2010-05-07 13:58:21 -07:00
Lin Ming
6bde9b6ce0 perf: Add group scheduling transactional APIs
Add group scheduling transactional APIs to struct pmu.
These APIs will be implemented in arch code, based on Peter's idea as
below.

> the idea behind hw_perf_group_sched_in() is to not perform
> schedulability tests on each event in the group, but to add the group
> as a whole and then perform one test.
>
> Of course, when that test fails, you'll have to roll-back the whole
> group again.
>
> So start_txn (or a better name) would simply toggle a flag in the pmu
> implementation that will make pmu::enable() not perform the
> schedulablilty test.
>
> Then commit_txn() will perform the schedulability test (so note the
> method has to have a !void return value.
>
> This will allow us to use the regular
> kernel/perf_event.c::group_sched_in() and all the rollback code.
> Currently each hw_perf_group_sched_in() implementation duplicates all
> the rolllback code (with various bugs).

->start_txn:
Start group events scheduling transaction, set a flag to make
pmu::enable() not perform the schedulability test, it will be performed
at commit time.

->commit_txn:
Commit group events scheduling transaction, perform the group
schedulability as a whole

->cancel_txn:
Stop group events scheduling transaction, clear the flag so
pmu::enable() will perform the schedulability test.

Reviewed-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1272002160.5707.60.camel@minggr.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-07 11:31:02 +02:00
Peter Zijlstra
a0507c84bf perf: Annotate perf_event_read_group() vs perf_event_release_kernel()
Stephane reported a lockdep warning while using PERF_FORMAT_GROUP.

The issue is that perf_event_read_group() takes faults while holding
the ctx->mutex, while perf_event_release_kernel() can be called from
munmap(). Which makes for an AB-BA deadlock.

Except we can never establish the deadlock because we'll only ever
call perf_event_release_kernel() after all file descriptors are dead
so there is no concurrency possible.

Reported-by: Stephane Eranian <eranian@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-07 11:30:59 +02:00
Ingo Molnar
cce9131781 Merge branch 'perf/urgent' into perf/core
Merge reason: Resolve patch dependency

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-07 11:30:30 +02:00
Peter Zijlstra
4fd38e4595 perf: Fix exit() vs PERF_FORMAT_GROUP
Both Stephane and Corey reported that PERF_FORMAT_GROUP didn't work
as expected if the task the counters were attached to quit before
the read() call.

The cause is that we unconditionally destroy the grouping when we
remove counters from their context. Fix this by only doing this when
we free the counter itself.

Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1273160566.5605.404.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-07 11:30:17 +02:00
Peter Zijlstra
27a9da6538 sched: Remove rq argument to the tracepoints
struct rq isn't visible outside of sched.o so its near useless to
expose the pointer, also there are no users of it, so remove it.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1272997616.1642.207.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-07 11:28:17 +02:00
Ingo Molnar
48652ced15 Merge commit 'v2.6.34-rc6' into sched/core 2010-05-07 11:27:54 +02:00
Yong Zhang
4726f2a617 lockdep: Reduce stack_trace usage
When calling check_prevs_add(), if all validations passed
add_lock_to_list() will add new lock to dependency tree and
alloc stack_trace for each list_entry.

But at this time, we are always on the same stack, so stack_trace
for each list_entry has the same value. This is redundant and eats
up lots of memory which could lead to warning on low
MAX_STACK_TRACE_ENTRIES.

Use one copy of stack_trace instead.

V2: As suggested by Peter Zijlstra, move save_trace() from
    check_prevs_add() to check_prev_add().
    Add tracking for trylock dependence which is also redundant.

Signed-off-by: Yong Zhang <yong.zhang0@windriver.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100504065711.GC10784@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-07 11:27:26 +02:00
Paul E. McKenney
fc390cde36 rcu: need barrier() in UP synchronize_sched_expedited()
If synchronize_sched_expedited() is ever to be called from within
kernel/sched.c in a !SMP PREEMPT kernel, the !SMP implementation needs
a barrier().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-05-07 07:23:21 +02:00
Dan Carpenter
d9f599e1e6 perf: Fix check at end of event search
The original code doesn't work because "call" is never NULL there.

Signed-off-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <20100320143911.GF5331@bicker>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-06 19:49:52 -04:00
Paul E. McKenney
cc631fb732 sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
The memory barriers must be in the SMP case, not in the !SMP case.
Also add a barrier after the atomic_inc() in order to ensure that
other CPUs see post-synchronize_sched_expedited() actions as following
the expedited grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-05-06 18:49:21 +02:00
Tejun Heo
94458d5ecb sched: kill paranoia check in synchronize_sched_expedited()
The paranoid check which verifies that the cpu_stop callback is
actually called on all online cpus is completely superflous.  It's
guaranteed by cpu_stop facility and if it didn't work as advertised
other things would go horribly wrong and trying to recover using
synchronize_sched() wouldn't be very meaningful.

Kill the paranoid check.  Removal of this feature is done as a
separate step so that it can serve as a bisection point if something
actually goes wrong.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
2010-05-06 18:49:21 +02:00
Tejun Heo
969c79215a sched: replace migration_thread with cpu_stop
Currently migration_thread is serving three purposes - migration
pusher, context to execute active_load_balance() and forced context
switcher for expedited RCU synchronize_sched.  All three roles are
hardcoded into migration_thread() and determining which job is
scheduled is slightly messy.

This patch kills migration_thread and replaces all three uses with
cpu_stop.  The three different roles of migration_thread() are
splitted into three separate cpu_stop callbacks -
migration_cpu_stop(), active_load_balance_cpu_stop() and
synchronize_sched_expedited_cpu_stop() - and each use case now simply
asks cpu_stop to execute the callback as necessary.

synchronize_sched_expedited() was implemented with private
preallocated resources and custom multi-cpu queueing and waiting
logic, both of which are provided by cpu_stop.
synchronize_sched_expedited_count is made atomic and all other shared
resources along with the mutex are dropped.

synchronize_sched_expedited() also implemented a check to detect cases
where not all the callback got executed on their assigned cpus and
fall back to synchronize_sched().  If called with cpu hotplug blocked,
cpu_stop already guarantees that and the condition cannot happen;
otherwise, stop_machine() would break.  However, this patch preserves
the paranoid check using a cpumask to record on which cpus the stopper
ran so that it can serve as a bisection point if something actually
goes wrong theree.

Because the internal execution state is no longer visible,
rcu_expedited_torture_stats() is removed.

This patch also renames cpu_stop threads to from "stopper/%d" to
"migration/%d".  The names of these threads ultimately don't matter
and there's no reason to make unnecessary userland visible changes.

With this patch applied, stop_machine() and sched now share the same
resources.  stop_machine() is faster without wasting any resources and
sched migration users are much cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
2010-05-06 18:49:21 +02:00
Tejun Heo
3fc1f1e27a stop_machine: reimplement using cpu_stop
Reimplement stop_machine using cpu_stop.  As cpu stoppers are
guaranteed to be available for all online cpus,
stop_machine_create/destroy() are no longer necessary and removed.

With resource management and synchronization handled by cpu_stop, the
new implementation is much simpler.  Asking the cpu_stop to execute
the stop_cpu() state machine on all online cpus with cpu hotplug
disabled is enough.

stop_machine itself doesn't need to manage any global resources
anymore, so all per-instance information is rolled into struct
stop_machine_data and the mutex and all static data variables are
removed.

The previous implementation created and destroyed RT workqueues as
necessary which made stop_machine() calls highly expensive on very
large machines.  According to Dimitri Sivanich, preventing the dynamic
creation/destruction makes booting faster more than twice on very
large machines.  cpu_stop resources are preallocated for all online
cpus and should have the same effect.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
2010-05-06 18:49:20 +02:00
Tejun Heo
1142d81029 cpu_stop: implement stop_cpu[s]()
Implement a simplistic per-cpu maximum priority cpu monopolization
mechanism.  A non-sleeping callback can be scheduled to run on one or
multiple cpus with maximum priority monopolozing those cpus.  This is
primarily to replace and unify RT workqueue usage in stop_machine and
scheduler migration_thread which currently is serving multiple
purposes.

Four functions are provided - stop_one_cpu(), stop_one_cpu_nowait(),
stop_cpus() and try_stop_cpus().

This is to allow clean sharing of resources among stop_cpu and all the
migration thread users.  One stopper thread per cpu is created which
is currently named "stopper/CPU".  This will eventually replace the
migration thread and take on its name.

* This facility was originally named cpuhog and lived in separate
  files but Peter Zijlstra nacked the name and thus got renamed to
  cpu_stop and moved into stop_machine.c.

* Better reporting of preemption leak as per Peter's suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
2010-05-06 18:49:20 +02:00
Paul E. McKenney
ee84b8243b rcu: create rcu_my_thread_group_empty() wrapper
Some RCU-lockdep splat repairs need to know whether they are running
in a single-threaded process.  Unfortunately, the thread_group_empty()
primitive is defined in sched.h, and can induce #include hell.  This
commit therefore introduces a rcu_my_thread_group_empty() wrapper that
is defined in rcupdate.c, thus avoiding the need to include sched.h
everywhere.

Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
2010-05-06 09:28:41 -07:00
James Morris
043b4d40f5 Merge branch 'master' into next
Conflicts:
	security/keys/keyring.c

Resolved conflict with whitespace fix in find_keyring_by_name()

Signed-off-by: James Morris <jmorris@namei.org>
2010-05-06 22:21:04 +10:00
James Morris
0ffbe2699c Merge branch 'master' into next 2010-05-06 10:56:07 +10:00
Thiago Farina
668eb65f09 tracing: Fix "integer as NULL pointer" warning.
kernel/trace/trace_output.c:256:24: warning: Using plain integer as NULL pointer

Signed-off-by: Thiago Farina <tfransosi@gmail.com>
LKML-Reference: <1264349038-1766-3-git-send-email-tfransosi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-05 12:01:26 -04:00
Linus Torvalds
8777c793d6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: flush_delayed_work: keep the original workqueue for re-queueing
2010-05-05 07:56:36 -07:00
Linus Torvalds
f5fa05d972 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: Fix resource leak in failure path of perf_event_open()
2010-05-04 15:16:15 -07:00
Linus Torvalds
f2809d61d6 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcu: Fix RCU lockdep splat on freezer_fork path
  rcu: Fix RCU lockdep splat in set_task_cpu on fork path
  mutex: Don't spin when the owner CPU is offline or other weird cases
2010-05-04 15:15:43 -07:00
Li Zefan
b629317e66 sched: Fix an RCU warning in print_task()
With CONFIG_PROVE_RCU=y, a warning can be triggered:

  $ cat /proc/sched_debug

...
kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection!
...

Both cgroup_path() and task_group() should be called with either
rcu_read_lock or cgroup_mutex held.

The rcu_dereference_check() does include cgroup_lock_is_held(), so we
know that this lock is not held.  Therefore, in a CONFIG_PREEMPT kernel,
to say nothing of a CONFIG_PREEMPT_RT kernel, the original code could
have ended up copying a string out of the freelist.

This patch inserts RCU read-side primitives needed to prevent this
scenario.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-04 09:25:01 -07:00
Li Zefan
fae9c79170 cgroup: Fix an RCU warning in alloc_css_id()
With CONFIG_PROVE_RCU=y, a warning can be triggered:

  # mount -t cgroup -o memory xxx /mnt
  # mkdir /mnt/0

...
kernel/cgroup.c:4442 invoked rcu_dereference_check() without protection!
...

This is a false-positive. It's safe to directly access parent_css->id.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-04 09:25:00 -07:00
Li Zefan
9a9686b634 cgroup: Fix an RCU warning in cgroup_path()
with CONFIG_PROVE_RCU=y, a warning can be triggered:

  # mount -t cgroup -o debug xxx /mnt
  # cat /proc/$$/cgroup

...
kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection!
...

This is a false-positive, because cgroup_path() can be called
with either rcu_read_lock() held or cgroup_mutex held.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-05-04 09:24:59 -07:00
Borislav Petkov
956097912c ring-buffer: Wrap open-coded WARN_ONCE
Wrap open-coded WARN_ONCE functionality into the equivalent macro.

Signed-off-by: Borislav Petkov <bp@alien8.de>
LKML-Reference: <20100502060354.GA5281@liondog.tnic>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-05-04 12:23:47 -04:00
Frederic Weisbecker
777d0411cd hw_breakpoints: Fix percpu build failure
Fix this build error:

   kernel/hw_breakpoint.c:58:1: error: pasting "__pcpu_scope_" and "*" does not give a valid preprocessing token

It happens if CONFIG_DEBUG_FORCE_WEAK_PER_CPU, because we concatenate
someting with the name and we have the "*" in the name.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
LKML-Reference: <20100503133942.GA5497@nowhere>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-04 08:39:36 +02:00
Frederic Weisbecker
54d47a2be5 lockdep: No need to disable preemption in debug atomic ops
No need to disable preemption in the debug_atomic_* ops, as
we ensure interrupts are disabled already.

So let's use the __this_cpu_ops() rather than this_cpu_ops() that
enclose the ops in a preempt disabled section.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-05-04 05:38:16 +02:00
Frederic Weisbecker
fa9a97dec6 lockdep: Actually _dec_ in debug_atomic_dec
Fix a silly copy-paste mistake that was making debug_atomic_dec use
this_cpu_inc instead of this_cpu_dec.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-05-04 05:38:12 +02:00
Frederic Weisbecker
ba697f40db lockdep: Provide off case for redundant_hardirqs_on increment
We forgot to provide a !CONFIG_DEBUG_LOCKDEP case for the
redundant_hardirqs_on stat handling.

Manage that in the headers with a new __debug_atomic_inc() helper.

Fixes:

	kernel/lockdep.c:2306: error: 'lockdep_stats' undeclared (first use in this function)
	kernel/lockdep.c:2306: error: (Each undeclared identifier is reported only once
	kernel/lockdep.c:2306: error: for each function it appears in.)

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-05-04 05:37:28 +02:00
Peter P Waskiewicz Jr
e7a297b0d7 genirq: Add CPU mask affinity hint
This patch adds a cpumask affinity hint to the irq_desc structure,
along with a registration function and a read-only proc entry for each
interrupt.

This affinity_hint handle for each interrupt can be used by underlying
drivers that need a better mechanism to control interrupt affinity.
The underlying driver can register a cpumask for the interrupt, which
will allow the driver to provide the CPU mask for the interrupt to
anything that requests it.  The intent is to extend the userspace
daemon, irqbalance, to help hint to it a preferred CPU mask to balance
the interrupt into.

[ tglx: Fixed compile warnings, added WARN_ON, made SMP only ]

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Cc: davem@davemloft.net
Cc: arjan@linux.jf.intel.com
Cc: bhutchings@solarflare.com
LKML-Reference: <20100430214445.3992.41647.stgit@ppwaskie-hc2.jf.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-03 11:50:57 +02:00
Steffen Klassert
6751fb3c0e padata: Use get_online_cpus/put_online_cpus
This patch puts get_online_cpus/put_online_cpus around the places
we modify the padata cpumask to ensure that no cpu goes offline
during this operation.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-03 11:32:12 +08:00
Steffen Klassert
7b389b2cc5 padata: Initialize the padata queues only for the used cpus
padata_alloc_pd set up queues for all possible cpus.
This patch changes this to set up the queues just for
the used cpus.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-03 11:32:11 +08:00
Steffen Klassert
7d0d2d385c padata: Remove superfluous might_sleep
might_sleep() was placed before mutex_lock() in some places.
We remove them because mutex_lock() does might_sleep() too.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-03 11:32:11 +08:00
Steffen Klassert
e2cb2f1c2c padata: cpu hotplug code should depend on CONFIG_HOTPLUG_CPU
This patch makes the padata cpu hotplug code dependend on CONFIG_HOTPLUG_CPU.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-03 11:32:11 +08:00
Herbert Xu
df2071bd08 Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2010-05-03 11:28:58 +08:00
Steffen Klassert
97e3d94aac padata: Dont scale the parallel objects with the cpus
Scaling the maximum number of objects in the parallel
codepath can lead to out of memory problems on bigsmp
machines.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-05-03 11:16:13 +08:00
Tejun Heo
048c852051 perf: Fix resource leak in failure path of perf_event_open()
perf_event_open() kfrees event after init failure which doesn't
release all resources allocated by perf_event_alloc().  Use
free_event() instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: <stable@kernel.org>
LKML-Reference: <4BDBE237.1040809@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-01 13:11:25 +02:00
Frederic Weisbecker
feef47d0cb hw-breakpoints: Get the number of available registers on boot dynamically
The breakpoint generic layer assumes that archs always know in advance
the static number of address registers available to host breakpoints
through the HBP_NUM macro.

However this is not true for every archs. For example Arm needs to get
this information dynamically to handle the compatiblity between
different versions.

To solve this, this patch proposes to drop the static HBP_NUM macro
and let the arch provide the number of available slots through a
new hw_breakpoint_slots() function. For archs that have
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS selected, it will be called once
as the number of registers fits for instruction and data breakpoints
together.
For the others it will be called first to get the number of
instruction breakpoint registers and another time to get the
data breakpoint registers, the targeted type is given as a
parameter of hw_breakpoint_slots().

Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-05-01 04:32:14 +02:00
Frederic Weisbecker
f93a205411 hw-breakpoints: Handle breakpoint weight in allocation constraints
Depending on their nature and on what an arch supports, breakpoints
may consume more than one address register. For example a simple
absolute address match usually only requires one address register.
But an address range match may consume two registers.

Currently our slot allocation constraints, that tend to reflect the
limited arch's resources, always consider that a breakpoint consumes
one slot.

Then provide a way for archs to tell us the weight of a breakpoint
through a new hw_breakpoint_weight() helper. This weight will be
computed against the generic allocation constraints instead of
a constant value.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
2010-05-01 04:32:12 +02:00
Frederic Weisbecker
0102752e4c hw-breakpoints: Separate constraint space for data and instruction breakpoints
There are two outstanding fashions for archs to implement hardware
breakpoints.

The first is to separate breakpoint address pattern definition
space between data and instruction breakpoints. We then have
typically distinct instruction address breakpoint registers
and data address breakpoint registers, delivered with
separate control registers for data and instruction breakpoints
as well. This is the case of PowerPc and ARM for example.

The second consists in having merged breakpoint address space
definition between data and instruction breakpoint. Address
registers can host either instruction or data address and
the access mode for the breakpoint is defined in a control
register. This is the case of x86 and Super H.

This patch adds a new CONFIG_HAVE_MIXED_BREAKPOINTS_REGS config
that archs can select if they belong to the second case. Those
will have their slot allocation merged for instructions and
data breakpoints.

The others will have a separate slot tracking between data and
instruction breakpoints.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
2010-05-01 04:32:11 +02:00
Frederic Weisbecker
b2812d031d hw-breakpoints: Change/Enforce some breakpoints policies
The current policies of breakpoints in x86 and SH are the following:

- task bound breakpoints can only break on userspace addresses
- cpu wide breakpoints can only break on kernel addresses

The former rule prevents ptrace breakpoints to be set to trigger on
kernel addresses, which is good. But as a side effect, we can't
breakpoint on kernel addresses for task bound breakpoints.

The latter rule simply makes no sense, there is no reason why we
can't set breakpoints on userspace while performing cpu bound
profiles.

We want the following new policies:

- task bound breakpoint can set userspace address breakpoints, with
no particular privilege required.
- task bound breakpoints can set kernelspace address breakpoints but
must be privileged to do that.
- cpu bound breakpoints can do what they want as they are privileged
already.

To implement these new policies, this patch checks if we are dealing
with a kernel address breakpoint, if so and if the exclude_kernel
parameter is set, we tell the user that the breakpoint is invalid,
which makes a good generic ptrace protection.
If we don't have exclude_kernel, ensure the user has the right
privileges as kernel breakpoints are quite sensitive (risk of
trap recursion attacks and global performance impacts).

[ Paul Mundt: keep addr space check for sh signal delivery and fix
  double function declaration]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2010-05-01 04:32:10 +02:00
Frederic Weisbecker
87e9b20246 hw-breakpoints: Check disabled breakpoints again
We stopped checking disabled breakpoints because we weren't
allowing breakpoints on NULL addresses. And gdb tends to set
NULL addresses on inactive breakpoints.

But refusing NULL addresses was actually a regression that has
been fixed now. There is no reason anymore to not validate
inactive breakpoint settings.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
2010-05-01 04:32:09 +02:00
Kei Tokunaga
bf81623542 [SCSI] add scsi trace core functions and put trace points
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
Signed-off-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2010-04-30 12:51:10 -05:00
Kei Tokunaga
5a2e399595 [SCSI] ftrace: add __print_hex()
__print_hex() prints values in an array in hex (w/o '0x') (space separated)
EX) 92 33 32 f3 ee 4d

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
Signed-off-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2010-04-30 12:50:22 -05:00
Frederic Weisbecker
913769f24e lockdep: Simplify debug atomic ops
Simplify debug_atomic_inc/dec by using this_cpu_inc/dec() instead
of doing it through an indirect get_cpu_var() and a manual
incrementation.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2010-04-30 19:15:56 +02:00
Frederic Weisbecker
8795d7717c lockdep: Fix redundant_hardirqs_on incremented with irqs enabled
When a path restore the flags while irqs are already enabled, we
update the per cpu var redundant_hardirqs_on in a racy fashion
and debug_atomic_inc() warns about this situation.

In this particular case, loosing a few hits in a stat is not a big
deal, so increment it without protection.

v2: Don't bother with disabling irq, we can miss one count in
    rare situations

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-04-30 19:15:49 +02:00
Frederic Weisbecker
868c522b1b Merge commit 'v2.6.34-rc6' into core/locking
Merge reason: Further lockdep patches depend on per cpu updates
made in -rc1.
2010-04-30 19:12:47 +02:00
Paul E. McKenney
8b46f88084 rcu: Fix RCU lockdep splat on freezer_fork path
Add an RCU read-side critical section to suppress this false
positive.

Located-by: Eric Paris <eparis@parisplace.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: eric.dumazet@gmail.com
LKML-Reference: <1271880131-3951-2-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-30 12:03:17 +02:00
Peter Zijlstra
8b08ca52f5 rcu: Fix RCU lockdep splat in set_task_cpu on fork path
Add an RCU read-side critical section to suppress this false
positive.

Located-by: Eric Paris <eparis@parisplace.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: eric.dumazet@gmail.com
LKML-Reference: <1271880131-3951-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-30 12:03:17 +02:00
Ingo Molnar
3ca50496c2 Merge commit 'v2.6.34-rc6' into perf/core
Merge reason: update to the latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-30 09:56:44 +02:00
Oleg Nesterov
4d707b9f48 workqueue: change cancel_work_sync() to clear work->data
In short: change cancel_work_sync(work) to mark this work as "never
queued" upon return.

When cancel_work_sync(work) succeeds, we know that this work can't be
queued or running, and since we own WORK_STRUCT_PENDING nobody can change
the bits in work->data under us. This means we can also clear the "cwq"
part along with _PENDING bit lockless before return, unless the work is
queued nobody can assume get_wq_data() is stable even under cwq->lock.

This change can speedup the subsequent cancel/flush requests, and as
Dmitry pointed out this simplifies the usage of work_struct's which
can be queued on different workqueues. Consider this pseudo code from
the input subsystem:

	struct workqueue_struct *WQ;
	struct work_struct *WORK;

	for (;;) {
		WQ = create_workqueue();
		...
		if (condition())
			queue_work(WQ, WORK);
		...
		cancel_work_sync(WORK);
		destroy_workqueue(WQ);
	}

If condition() returns T and then F, cancel_work_sync() will crash the
kernel because WORK->data still points to the already destroyed workqueue.
With this patch the code like above becomes correct.

Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-04-30 08:57:25 +02:00
Alan Stern
eef6a7d5c2 workqueue: warn about flush_scheduled_work()
This patch (as1319) adds kerneldoc and a pointed warning to
flush_scheduled_work().

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-04-30 08:57:25 +02:00
Oleg Nesterov
47dd5be2d6 workqueue: flush_delayed_work: keep the original workqueue for re-queueing
flush_delayed_work() always uses keventd_wq for re-queueing,
but it should use the workqueue this dwork was queued on.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-04-30 07:24:51 +02:00
Jens Axboe
7407cf355f Merge branch 'master' into for-2.6.35
Conflicts:
	fs/block_dev.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-29 09:36:24 +02:00
Steven Rostedt
37e44bc50d tracing: Fix sleep time function profiling
When sleep_time is off the function profiler ignores the time that a task
is scheduled out. When the task is scheduled out a timestamp is taken.
When the task is scheduled back in, the timestamp is compared to the
current time and the saved calltimes are adjusted accordingly.

But when stopping the function profiler, the sched switch hook that
does this adjustment was stopped before shutting down the tracer.
This allowed some tasks to not get their timestamps set when they
scheduled out. When the function profiler started again, this would
skew the times of the scheduler functions.

This patch moves the stopping of the sched switch to after the function
profiler is stopped. It also ignores zero set calltimes, which may
happen on start up.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-27 21:04:24 -04:00
Chase Douglas
e330b3bcd8 tracing: Show sample std dev in function profiling
When combined with function graph tracing the ftrace function profiler
also prints the average run time of functions. While this gives us some
good information, it doesn't tell us anything about the variance of the
run times of the function. This change prints out the s^2 sample
standard deviation alongside the average.

This change adds one entry to the profile record structure. This
increases the memory footprint of the function profiler by 1/3 on a
32-bit system, and by 1/5 on a 64-bit system when function graphing is
enabled, though the memory is only allocated when the profiler is turned
on. During the profiling, one extra line of code adds the squared
calltime to the new record entry, so this should not adversly affect
performance.

Note that the square of the sample standard deviation is printed because
there is no sqrt implementation for unsigned long long in the kernel.

Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1272304925-2436-1-git-send-email-chase.douglas@canonical.com>

[ fixed comment about ns^2 -> us^2 conversion ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-27 18:23:15 -04:00
Steven Rostedt
a838b2e634 ring-buffer: Make benchmark handle missed events
With the addition of the "missed events" flags that is stored in the
commit field of the ring buffer page, the ring_buffer_benchmark
was not updated to handle this. If events are missed, then the
missed events flag is set in the ring buffer page, the benchmark
will count that flag as part of the size of the page and will hit the BUG()
when it tries to read beyond the page.

The solution is simply to have the ring buffer benchmark mask off
the extra bits.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-27 13:26:58 -04:00
David Miller
72c9ddfd4c ring-buffer: Make non-consuming read less expensive with lots of cpus.
When performing a non-consuming read, a synchronize_sched() is
performed once for every cpu which is actively tracing.

This is very expensive, and can make it take several seconds to open
up the 'trace' file with lots of cpus.

Only one synchronize_sched() call is actually necessary.  What is
desired is for all cpus to see the disabling state change.  So we
transform the existing sequence:

	for_each_cpu() {
		ring_buffer_read_start();
	}

where each ring_buffer_start() call performs a synchronize_sched(),
into the following:

	for_each_cpu() {
		ring_buffer_read_prepare();
	}
	ring_buffer_read_prepare_sync();
	for_each_cpu() {
		ring_buffer_read_start();
	}

wherein only the single ring_buffer_read_prepare_sync() call needs to
do the synchronize_sched().

The first phase, via ring_buffer_read_prepare(), allocates the 'iter'
memory and increments ->record_disabled.

In the second phase, ring_buffer_read_prepare_sync() makes sure this
->record_disabled state is visible fully to all cpus.

And in the final third phase, the ring_buffer_read_start() calls reset
the 'iter' objects allocated in the first phase since we now know that
none of the cpus are adding trace entries any more.

This makes openning the 'trace' file nearly instantaneous on a
sparc64 Niagara2 box with 128 cpus tracing.

Signed-off-by: David S. Miller <davem@davemloft.net>
LKML-Reference: <20100420.154711.11246950.davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-27 13:06:35 -04:00
Jiri Olsa
62b915f106 tracing: Add graph output support for irqsoff tracer
Add function graph output to irqsoff tracer.

The graph output is enabled by setting new 'display-graph' trace option.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1270227683-14631-4-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-27 12:36:53 -04:00
Alessio Igor Bogani
b8bc1389b7 ptrace: Cleanup useless header
BKL isn't present anymore into this file thus we can safely remove
smp_lock.h inclusion.

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-04-26 23:42:51 +02:00
Jiri Olsa
d7a8d9e907 tracing: Have graph flags passed in to ouput functions
Let the function graph tracer have custom flags passed to its
output functions.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1270227683-14631-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-26 17:30:18 -04:00
Jiri Olsa
9106b69382 tracing: Add ftrace events for graph tracer
Add ftrace events for graph tracer, so the graph output could be shared
with other tracers.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1270227683-14631-2-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-26 16:55:08 -04:00
Andreas Schwab
46da276648 kernel/sys.c: fix compat uname machine
On ppc64 you get this error:

  $ setarch ppc -R true
  setarch: ppc: Unrecognized architecture

because uname still reports ppc64 as the machine.

So mask off the personality flags when checking for PER_LINUX32.

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-24 11:31:24 -07:00
Robert Richter
b971f06187 Merge commit 'tip/tracing/core' into oprofile/core
Conflicts:
	drivers/oprofile/cpu_buffer.c

Signed-off-by: Robert Richter <robert.richter@amd.com>
2010-04-23 16:47:51 +02:00
Ingo Molnar
77a7f2e94e Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core 2010-04-23 11:25:31 +02:00
Ingo Molnar
70bce3ba77 Merge branch 'linus' into perf/core
Merge reason: merge the latest fixes, update to latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-23 11:10:30 +02:00
Suresh Siddha
99bd5e2f24 sched: Fix select_idle_sibling() logic in select_task_rq_fair()
Issues in the current select_idle_sibling() logic in select_task_rq_fair()
in the context of a task wake-up:

a) Once we select the idle sibling, we use that domain (spanning the cpu that
   the task is currently woken-up and the idle sibling that we found) in our
   wake_affine() decisions. This domain is completely different from the
   domain(we are supposed to use) that spans the cpu that the task currently
   woken-up and the cpu where the task previously ran.

b) We do select_idle_sibling() check only for the cpu that the task is
   currently woken-up on. If select_task_rq_fair() selects the previously run
   cpu for waking the task, doing a select_idle_sibling() check
   for that cpu also helps and we don't do this currently.

c) In the scenarios where the cpu that the task is woken-up is busy but
   with its HT siblings are idle, we are selecting the task be woken-up
   on the idle HT sibling instead of a core that it previously ran
   and currently completely idle. i.e., we are not taking decisions based on
   wake_affine() but directly selecting an idle sibling that can cause
   an imbalance at the SMT/MC level which will be later corrected by the
   periodic load balancer.

Fix this by first going through the load imbalance calculations using
wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
then choose a possible idle sibling for waking up the task on.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-23 11:02:02 +02:00
Peter Zijlstra
669c55e9f9 sched: Pre-compute cpumask_weight(sched_domain_span(sd))
Dave reported that his large SPARC machines spend lots of time in
hweight64(), try and optimize some of those needless cpumask_weight()
invocations (esp. with the large offstack cpumasks these are very
expensive indeed).

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-23 11:02:02 +02:00
Peter Zijlstra
74f5187ac8 sched: Cure load average vs NO_HZ woes
Chase reported that due to us decrementing calc_load_task prematurely
(before the next LOAD_FREQ sample), the load average could be scewed
by as much as the number of CPUs in the machine.

This patch, based on Chase's patch, cures the problem by keeping the
delta of the CPU going into NO_HZ idle separately and folding that in
on the next LOAD_FREQ update.

This restores the balance and we get strict LOAD_FREQ period samples.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1271934490.1776.343.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-23 11:02:02 +02:00
Benjamin Herrenschmidt
4b40221048 mutex: Don't spin when the owner CPU is offline or other weird cases
Due to recent load-balancer changes that delay the task migration to
the next wakeup, the adaptive mutex spinning ends up in a live lock
when the owner's CPU gets offlined because the cpu_online() check
lives before the owner running check.

This patch changes mutex_spin_on_owner() to return 0 (don't spin) in
any case where we aren't sure about the owner struct validity or CPU
number, and if the said CPU is offline. There is no point going back &
re-evaluate spinning in corner cases like that, let's just go to
sleep.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1271212509.13059.135.camel@pasglop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-23 11:00:28 +02:00
Jiri Kosina
6c9468e9eb Merge branch 'master' into for-next 2010-04-23 02:08:44 +02:00
David Howells
e134d200d5 CRED: Fix a race in creds_are_invalid() in credentials debugging
creds_are_invalid() reads both cred->usage and cred->subscribers and then
compares them to make sure the number of processes subscribed to a cred struct
never exceeds the refcount of that cred struct.

The problem is that this can cause a race with both copy_creds() and
exit_creds() as the two counters, whilst they are of atomic_t type, are only
atomic with respect to themselves, and not atomic with respect to each other.

This means that if creds_are_invalid() can read the values on one CPU whilst
they're being modified on another CPU, and so can observe an evolving state in
which the subscribers count now is greater than the usage count a moment
before.

Switching the order in which the counts are read cannot help, so the thing to
do is to remove that particular check.

I had considered rechecking the values to see if they're in flux if the test
fails, but I can't guarantee they won't appear the same, even if they've
changed several times in the meantime.

Note that this can only happen if CONFIG_DEBUG_CREDENTIALS is enabled.

The problem is only likely to occur with multithreaded programs, and can be
tested by the tst-eintr1 program from glibc's "make check".  The symptoms look
like:

	CRED: Invalid credentials
	CRED: At include/linux/cred.h:240
	CRED: Specified credentials: ffff88003dda5878 [real][eff]
	CRED: ->magic=43736564, put_addr=(null)
	CRED: ->usage=766, subscr=766
	CRED: ->*uid = { 0,0,0,0 }
	CRED: ->*gid = { 0,0,0,0 }
	CRED: ->security is ffff88003d72f538
	CRED: ->security {359, 359}
	------------[ cut here ]------------
	kernel BUG at kernel/cred.c:850!
	...
	RIP: 0010:[<ffffffff81049889>]  [<ffffffff81049889>] __invalid_creds+0x4e/0x52
	...
	Call Trace:
	 [<ffffffff8104a37b>] copy_creds+0x6b/0x23f

Note the ->usage=766 and subscr=766.  The values appear the same because
they've been re-read since the check was made.

Reported-by: Roland McGrath <roland@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-04-22 09:14:29 +10:00
Frederic Weisbecker
cecbca96da tracing: Dump either the oops's cpu source or all cpus buffers
The ftrace_dump_on_oops kernel parameter, sysctl and sysrq let one
dump every cpu buffers when an oops or panic happens.

It's nice when you have few cpus but it may take ages if have many,
plus you miss the real origin of the problem in all the cpu traces.

Sometimes, all you need is to dump the cpu buffer that triggered the
opps, most of the time it is our main interest.

This patch modifies ftrace_dump_on_oops to handle this choice.

The ftrace_dump_on_oops kernel parameter, when it comes alone, has
the same behaviour than before. But ftrace_dump_on_oops=orig_cpu
will only dump the buffer of the cpu that oops'ed.

Similarly, sysctl kernel.ftrace_dump_on_oops=1 and
echo 1 > /proc/sys/kernel/ftrace_dump_on_oops keep their previous
behaviour. But setting 2 jumps into cpu origin dump mode.

v2: Fix double setup
v3: Fix spelling issues reported by Randy Dunlap
v4: Also update __ftrace_dump in the selftests

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2010-04-21 23:11:42 +02:00
Ingo Molnar
ac0053fd51 Merge commit 'v2.6.34-rc5' into tracing/core
Merge reason: pick up latest -rc's.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-21 09:47:05 +02:00
David Howells
eff30363c0 CRED: Fix double free in prepare_usermodehelper_creds() error handling
Patch 570b8fb505:

	Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
	Date:   Tue Mar 30 00:04:00 2010 +0100
	Subject: CRED: Fix memory leak in error handling

attempts to fix a memory leak in the error handling by making the offending
return statement into a jump down to the bottom of the function where a
kfree(tgcred) is inserted.

This is, however, incorrect, as it does a kfree() after doing put_cred() if
security_prepare_creds() fails.  That will result in a double free if 'error'
is jumped to as put_cred() will also attempt to free the new tgcred record by
virtue of it being pointed to by the new cred record.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-04-21 09:20:35 +10:00
Zhang, Yanmin
39447b386c perf: Enhance perf to allow for guest statistic collection from host
Below patch introduces perf_guest_info_callbacks and related
register/unregister functions. Add more PERF_RECORD_MISC_XXX bits
meaning guest kernel and guest user space.

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-04-19 12:35:33 +03:00
Paul E. McKenney
bc293d62b2 rcu: Make RCU lockdep check the lockdep_recursion variable
The lockdep facility temporarily disables lockdep checking by
incrementing the current->lockdep_recursion variable.  Such
disabling happens in NMIs and in other situations where lockdep
might expect to recurse on itself.

This patch therefore checks current->lockdep_recursion, disabling RCU
lockdep splats when this variable is non-zero.  In addition, this patch
removes the "likely()", as suggested by Lai Jiangshan.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: David Miller <davem@davemloft.net>
Tested-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: eric.dumazet@gmail.com
LKML-Reference: <20100415195039.GA22623@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-19 08:37:19 +02:00
Mike Galbraith
09a40af524 sched: Fix UP update_avg() build warning
update_avg() is only used for SMP builds, move it to the nearest
SMP block.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1271309399.14779.17.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-15 09:36:47 +02:00
Ingo Molnar
b257c14ceb Merge branch 'linus' into sched/core
Merge reason: merge the latest fixes, update to -rc4.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-15 09:36:16 +02:00
Ingo Molnar
b5a80b7e91 Merge branch 'perf' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core 2010-04-15 09:16:51 +02:00
Divyesh Shah
b6ac23af2c blkio: fix for modular blk-cgroup build
After merging the block tree, 20100414's linux-next build (x86_64
allmodconfig) failed like this:

ERROR: "get_gendisk" [block/blk-cgroup.ko] undefined!
ERROR: "sched_clock" [block/blk-cgroup.ko] undefined!

This happens because the two symbols aren't exported and hence not available
when blk-cgroup code is built as a module. I've tried to stay consistent with
the use of EXPORT_SYMBOL or EXPORT_SYMBOL_GPL with the other symbols in the
respective files.

Signed-off-by: Divyesh Shah <dpshah@google.com>
Acked-by: Gui Jianfeng <guijianfeng@cn.fujitsu.cn>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-15 08:54:59 +02:00
Frederic Weisbecker
95476b64ab perf: Fix hlist related build error
hlist helpers need to be available for all software events, not
only trace events.

Pull them out outside the ifdef CONFIG_EVENT_TRACING section.

Fixes:
	kernel/perf_event.c:4573: error: implicit declaration of function 'swevent_hlist_put'
	kernel/perf_event.c:4614: error: implicit declaration of function 'swevent_hlist_get'
	kernel/perf_event.c:5534: error: implicit declaration of function 'swevent_hlist_release

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1271281338-23491-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-15 01:34:46 +02:00
Masami Hiramatsu
93ccae7a22 tracing/kprobes: Support basic types on dynamic events
Support basic types of integer (u8, u16, u32, u64, s8, s16, s32, s64) in
kprobe tracer. With this patch, users can specify above basic types on
each arguments after ':'. If omitted, the argument type is set as
unsigned long (u32 or u64, arch-dependent).

 e.g.
  echo 'p account_system_time+0 hardirq_offset=%si:s32' > kprobe_events

  adds a probe recording hardirq_offset in signed-32bits value on the
  entry of account_system_time.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100412171708.3790.18599.stgit@localhost6.localdomain6>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-04-14 17:26:28 -03:00
Frederic Weisbecker
df8290bf7e perf: Make clock software events consistent with general exclusion rules
The cpu/task clock events implement their own version of exclusion
on top of exclude_user and exclude_kernel.

The result is that when the event triggered in the kernel but we
have exclude_kernel set, we try to rewind using task_pt_regs.
There are two side effects of this:

- we call task_pt_regs even on kernel threads, which doesn't give
  us the desired result.
- if the event occured in the kernel, we shouldn't rewind to the
  user context. We want to actually ignore the event.

get_irq_regs() will always give us the right interrupted context, so
use its result and submit it to perf_exclude_context() that knows
when an event must be ignored.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
2010-04-14 18:20:33 +02:00
Frederic Weisbecker
76e1d9047e perf: Store active software events in a hashlist
Each time a software event triggers, we need to walk through
the entire list of events from the current cpu and task contexts
to retrieve a running perf event that matches.
We also need to check a matching perf event is actually counting.

This walk is wasteful and makes the event fast path scaling
down with a growing number of events running on the same
contexts.

To solve this, we store the running perf events in a hashlist to
get an immediate access to them against their type:event_id when
they trigger.

v2: - Fix SWEVENT_HLIST_SIZE definition (and re-learn some basic
      maths along the way)
    - Only allocate hlist for online cpus, but keep track of the
      refcount on offline possible cpus too, so that we allocate it
      if needed when it becomes online.
    - Drop the kref use as it's not adapted to our tricks anymore.

v3: - Fix bad refcount check (address instead of value). Thanks to
      Eric Dumazet who spotted this.
    - While exiting cpu, move the hlist release out of the IPI path
      to lock the hlist mutex sanely.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
2010-04-14 18:20:33 +02:00
Ingo Molnar
b15c7b1cee Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core 2010-04-14 12:15:23 +02:00
Dmitry Torokhov
97f5f0cd8c Input: implement SysRq as a separate input handler
Instead of keeping SysRq support inside of legacy keyboard driver split
it out into a separate input handler (filter). This stops most SysRq input
events from leaking into evdev clients (some events, such as first SysRq
scancode - not keycode - event, are still leaked into both legacy keyboard
and evdev).

[martinez.javier@gmail.com: fix compile error when CONFIG_MAGIC_SYSRQ is
 not defined]
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2010-04-13 23:26:02 -07:00
Thomas Gleixner
6932bf37be genirq: Remove IRQF_DISABLED from core code
Remove all code which is related to IRQF_DISABLED from the core kernel
code. IRQF_DISABLED still exists as a flag, but becomes a NOOP and
will be removed after a grace period. That way we can easily revert to
the previous behaviour by just restoring the core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@osdl.org>
LKML-Reference: <20100326000405.991244690@linutronix.de>
2010-04-13 16:36:40 +02:00
Ingo Molnar
e58aa3d2d0 genirq: Run irq handlers with interrupts disabled
Running interrupt handlers with interrupts enabled can cause stack
overflows. That has been observed with multiqueue NICs delivering all
their interrupts to a single core. We might band aid that somehow by
checking the interrupt stacks, but the real safe fix is to run the irq
handlers with interrupts disabled.

Drivers for whacky hardware still can reenable them in the handler
itself, if the need arises. (They do already due to lockdep)

The risk of doing this is rather low:

 - lockdep already enforces this
 - CONFIG_NOHZ has shaken out the drivers which relied on jiffies updates
 - time keeping is not longer sensitive to the timer interrupt being delayed

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@osdl.org>
LKML-Reference: <20100326000405.758579387@linutronix.de>
2010-04-13 16:36:40 +02:00
Marc Zyngier
ae731f8d07 genirq: Introduce request_any_context_irq()
Now that we enjoy threaded interrupts, we're starting to see irq_chip
implementations (wm831x, pca953x) that make use of threaded interrupts
for the controller, and nested interrupts for the client interrupt. It
all works very well, with one drawback:

Drivers requesting an IRQ must now know whether the handler will
run in a thread context or not, and call request_threaded_irq() or
request_irq() accordingly.

The problem is that the requesting driver sometimes doesn't know
about the nature of the interrupt, specially when the interrupt
controller is a discrete chip (typically a GPIO expander connected
over I2C) that can be connected to a wide variety of otherwise perfectly
supported hardware.

This patch introduces the request_any_context_irq() function that mostly
mimics the usual request_irq(), except that it checks whether the irq
level is configured as nested or not, and calls the right backend.
On success, it also returns either IRQC_IS_HARDIRQ or IRQC_IS_NESTED.

[ tglx: Made return value an enum, simplified code and made the export
  	of request_any_context_irq GPL ]

Signed-off-by: Marc Zyngier <maz@misterjones.org>
Cc: <joachim.eastwood@jotron.com>
LKML-Reference: <927ea285bd0c68934ddae1a47e44a9ba@localhost>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-13 16:36:39 +02:00
Thomas Gleixner
7c7145f6ac Merge branch 'linus' into irq/core
Reason: Get the upstream IRQF_DISABLED related changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-13 14:12:17 +02:00
John Stultz
6a867a3955 time: Remove xtime_cache
With the earlier logarithmic time accumulation patch, xtime will now
always be within one "tick" of the current time, instead of possibly
half a second off.

This removes the need for the xtime_cache value, which always stored the
time at the last interrupt, so this patch cleans that up removing the
xtime_cache related code.

This patch also addresses an issue with an earlier version of this change,
where xtime_cache was normalizing xtime, which could in some cases be
not valid (ie: tv_nsec == NSEC_PER_SEC). This is fixed by handling
the edge case in update_wall_time().

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Petr Titěra <P.Titera@century.cz>
LKML-Reference: <1270589451-30773-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-13 12:43:42 +02:00
Eric Paris
05b90496f2 security: remove dead hook acct
Unused hook.  Remove.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-04-12 12:19:19 +10:00
Eric Paris
6307f8fee2 security: remove dead hook task_setgroups
Unused hook.  Remove.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-04-12 12:19:18 +10:00
Eric Paris
06ad187e28 security: remove dead hook task_setgid
Unused hook.  Remove.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-04-12 12:19:17 +10:00
Eric Paris
43ed8c3b45 security: remove dead hook task_setuid
Unused hook.  Remove.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-04-12 12:19:16 +10:00
Eric Paris
0968d0060a security: remove dead hook cred_commit
Unused hook.  Remove.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-04-12 12:19:15 +10:00
Jiri Slaby
d88d4050dc PM / Hibernate: user.c, fix SNAPSHOT_SET_SWAP_AREA handling
When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device
improperly by old_decode_dev and it results in an error while
hibernating with s2disk.

All users already pass the new device number, so switch to
new_decode_dev().

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>
2010-04-10 22:28:56 +02:00
Arnd Bergmann
5534ecb2dd ptrace: kill BKL in ptrace syscall
The comment suggests that this usage is stale. There is no bkl in the
exec path so if there is a race lurking there, the bkl in ptrace is
not going to help in this regard.

Overview of the possibility of "accidental" races this bkl might
protect:

- ptrace_traceme() is protected against task removal and concurrent
read/write on current->ptrace as it locks write tasklist_lock.

- arch_ptrace_attach() is serialized by ptrace_traceme() against
concurrent PTRACE_TRACEME or PTRACE_ATTACH

- ptrace_attach() is protected the same way ptrace_traceme() and
in turn serializes arch_ptrace_attach()

- ptrace_check_attach() does its own well described serializing too.

There is no obvious race here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roland McGrath <roland@redhat.com>
2010-04-10 15:34:21 +02:00
Linus Torvalds
2aedd192f7 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix sched_getaffinity()
2010-04-08 08:37:05 -07:00
Ingo Molnar
ca7e0c6120 Merge branch 'linus' into perf/core
Semantic conflict: arch/x86/kernel/cpu/perf_event_intel_ds.c

Merge reason: pick up latest fixes, fix the conflict

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-08 13:37:18 +02:00
Ingo Molnar
c1ab9cab75 Merge branch 'linus' into tracing/core
Conflicts:
	include/linux/module.h
	kernel/module.c

Semantic conflict:
	include/trace/events/module.h

Merge reason: Resolve the conflict with upstream commit 5fbfb18 ("Fix up
              possibly racy module refcounting")

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-08 10:18:47 +02:00
KAMEZAWA Hiroyuki
a3a2e76c77 mm: avoid null-pointer deref in sync_mm_rss()
- We weren't zeroing p->rss_stat[] at fork()

- Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
  threads and was oopsing.

- Make __sync_task_rss_stat() static, too.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

[akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
Reported-by: Troels Liebe Bentsen <tlb@rapanden.dk>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
"Michael S. Tsirkin" <mst@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-07 08:38:02 -07:00
Linus Torvalds
94c4fcec01 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Force MSI irq handlers to run with interrupts disabled
2010-04-06 13:03:22 -07:00
Carsten Emde
351b3f7a21 hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME
The current version of schedule_hrtimeout() always uses the
monotonic clock. Some system calls such as mq_timedsend()
and mq_timedreceive(), however, require the use of the wall
clock due to the definition of the system call.

This patch provides the infrastructure to use schedule_hrtimeout() 
with a CLOCK_REALTIME timer.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Tested-by: Pradyumna Sampath <pradysam@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Veen <arjan@infradead.org>
LKML-Reference: <20100402204331.167439615@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-06 21:50:03 +02:00
Arjan van de Ven
3bbb9ec946 timers: Introduce the concept of timer slack for legacy timers
While HR timers have had the concept of timer slack for quite some time
now, the legacy timers lacked this concept, and had to make do with
round_jiffies() and friends.

Timer slack is important for power management; grouping timers reduces the
number of wakeups which in turn reduces power consumption.

This patch introduces timer slack to the legacy timers using the following
pieces:
* A slack field in the timer struct
* An api (set_timer_slack) that callers can use to set explicit timer slack
* A default slack of 0.4% of the requested delay for callers that do not set
  any explicit slack
* Rounding code that is part of mod_timer() that tries to
  group timers around jiffies values every 'power of two'
  (so quick timers will group around every 2, but longer timers
  will group around every 4, 8, 16, 32 etc)

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: johnstul@us.ibm.com
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-04-06 21:50:02 +02:00
Anton Blanchard
84fba5ec91 sched: Fix sched_getaffinity()
taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
the following error:

  sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)

This box has 128 threads and 16 bytes is enough to cover it.

Commit cd3d8031eb (sched:
sched_getaffinity(): Allow less than NR_CPUS length) is
comparing this 16 bytes agains nr_cpu_ids.

Fix it by comparing nr_cpu_ids to the number of bits in the
cpumask we pass in.

Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Sharyathi Nagesh <sharyath@in.ibm.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100406070218.GM5594@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-06 10:01:35 +02:00
Nick Piggin
5fbfb18d7a Fix up possibly racy module refcounting
Module refcounting is implemented with a per-cpu counter for speed.
However there is a race when tallying the counter where a reference may
be taken by one CPU and released by another.  Reference count summation
may then see the decrement without having seen the previous increment,
leading to lower than expected count.  A module which never has its
actual reference drop below 1 may return a reference count of 0 due to
this race.

Module removal generally runs under stop_machine, which prevents this
race causing bugs due to removal of in-use modules.  However there are
other real bugs in module.c code and driver code (module_refcount is
exported) where the callers do not run under stop_machine.

Fix this by maintaining running per-cpu counters for the number of
module refcount increments and the number of refcount decrements.  The
increments are tallied after the decrements, so any decrement seen will
always have its corresponding increment counted.  The final refcount is
the difference of the total increments and decrements, preventing a
low-refcount from being returned.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-05 19:50:02 -07:00
Frederic Weisbecker
bd6d29c25b lockstat: Make lockstat counting per cpu
Locking statistics are implemented using global atomic
variables. This is usually fine unless some path write them very
often.

This is the case for the function and function graph tracers
that disable irqs for each entry saved (except if the function
tracer is in preempt disabled only mode).
And calls to local_irq_save/restore() increment
hardirqs_on_events and hardirqs_off_events stats (or similar
stats for redundant versions).

Incrementing these global vars for each function ends up in too
much cache bouncing if lockstats are enabled.

To solve this, implement the debug_atomic_*() operations using
per cpu vars.

 -v2: Use per_cpu() instead of get_cpu_var() to fetch the desired
      cpu vars on debug_atomic_read()

 -v3: Store the stats in a structure. No need for local_t as we
      are NMI/irq safe.

 -v4: Fix tons of build errors. I thought I had tested it but I
      probably forgot to select the relevant config.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1270505417-8144-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
2010-04-06 00:15:37 +02:00
Eric Paris
449cedf099 audit: preface audit printk with audit
There have been a number of reports of people seeing the message:
"name_count maxed, losing inode data: dev=00:05, inode=3185"
in dmesg.  These usually lead to people reporting problems to the filesystem
group who are in turn clueless what they mean.

Eventually someone finds me and I explain what is going on and that
these come from the audit system.  The basics of the problem is that the
audit subsystem never expects a single syscall to 'interact' (for some
wish washy meaning of interact) with more than 20 inodes.  But in fact
some operations like loading kernel modules can cause changes to lots of
inodes in debugfs.

There are a couple real fixes being bandied about including removing the
fixed compile time limit of 20 or not auditing changes in debugfs (or
both) but neither are small and obvious so I am not sending them for
immediate inclusion (I hope Al forwards a real solution next devel
window).

In the meantime this patch simply adds 'audit' to the beginning of the
crap message so if a user sees it, they come blame me first and we can
talk about what it means and make sure we understand all of the reasons
it can happen and make sure this gets solved correctly in the long run.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-05 13:19:45 -07:00
Linus Torvalds
b66696e3c0 Merge branch 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc
* 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc:
  eeepc-wmi: include slab.h
  staging/otus: include slab.h from usbdrv.h
  percpu: don't implicitly include slab.h from percpu.h
  kmemcheck: Fix build errors due to missing slab.h
  include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
  iwlwifi: don't include iwl-dev.h from iwl-devtrace.h
  x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h

Fix up trivial conflicts in include/linux/percpu.h due to
is_kernel_percpu_address() having been introduced since the slab.h
cleanup with the percpu_up.c splitup.
2010-04-05 09:39:11 -07:00
Linus Torvalds
9e74e7c81a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  module: add stub for is_module_percpu_address
  percpu, module: implement and use is_kernel/module_percpu_address()
  module: encapsulate percpu handling better and record percpu_size
2010-04-05 09:16:37 -07:00
Lai Jiangshan
aa27497c2f tracing: Fix uninitialized variable of tracing/trace output
Because a local variable is not initialized, I got these
when I did 'cat tracing/trace'. (not trace_pipe):

CPU:0 [LOST 18446744071579453134 EVENTS]
              ps-3099  [000]   560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446744071579453134 EVENTS]
              ps-3099  [000]   560.770221: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446612133255294080 EVENTS]
              ps-3099  [000]   560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446744071579453134 EVENTS]
              ps-3099  [000]   560.770222: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock
CPU:0 [LOST 18446744071579453134 EVENTS]
              ps-3099  [000]   560.770222: lock_release: ffffffff816cfb98 dcache_lock

See peek_next_entry(), it does not set *lost_events when we 'cat tracing/trace'

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4BB9A929.2000303@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-05 11:01:22 -04:00
Tejun Heo
336f5899d2 Merge branch 'master' into export-slabh 2010-04-05 11:37:28 +09:00
Linus Torvalds
8ce42c8b7f Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: Always build the powerpc perf_arch_fetch_caller_regs version
  perf: Always build the stub perf_arch_fetch_caller_regs version
  perf, probe-finder: Build fix on Debian
  perf/scripts: Tuple was set from long in both branches in python_process_event()
  perf: Fix 'perf sched record' deadlock
  perf, x86: Fix callgraphs of 32-bit processes on 64-bit kernels
  perf, x86: Fix AMD hotplug & constraint initialization
  x86: Move notify_cpu_starting() callback to a later stage
  x86,kgdb: Always initialize the hw breakpoint attribute
  perf: Use hot regs with software sched switch/migrate events
  perf: Correctly align perf event tracing buffer
2010-04-04 12:13:10 -07:00
Linus Torvalds
0121b0c771 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock
  sched: Fix proc_sched_set_task()
2010-04-04 12:12:31 -07:00
Linus Torvalds
a8941b0ed0 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ring-buffer: Add missing unlock
  tracing: Fix lockdep warning in global_clock()
2010-04-04 12:12:19 -07:00
Arnd Bergmann
3326c1ceee perf_event: Make perf fd non seekable
Perf_event does not need seeking, so prevent it in order to
get rid of default_llseek, which uses the BKL.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
[drop the nonseekable_open, not needed for anon inodes]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-04-04 15:27:58 +02:00
Ingo Molnar
22a4e4c435 Merge branch 'perf/urgent' into perf/core
Conflicts:
	tools/perf/Makefile

Merge reason: resolve the conflict.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-03 18:17:55 +02:00
Frederic Weisbecker
26d80aa782 perf: Always build the stub perf_arch_fetch_caller_regs version
Now that software events use perf_arch_fetch_caller_regs() too, we
need the stub version to be always built in for archs that don't
implement it.

Fixes the following build error in PARISC:

	kernel/built-in.o: In function `perf_event_task_sched_out':
	(.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs'

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
2010-04-03 12:22:05 +02:00
Linus Torvalds
5e123e5d9b Merge branch 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb
* 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
  kgdb: Turn off tracing while in the debugger
  kgdb: use atomic_inc and atomic_dec instead of atomic_set
  kgdb: eliminate kgdb_wait(), all cpus enter the same way
  kgdbts,sh: Add in breakpoint pc offset for superh
  kgdb: have ebin2mem call probe_kernel_write once
2010-04-02 19:45:05 -07:00
Linus Torvalds
24b99d1576 Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  Freezer: Fix buggy resume test for tasks frozen with cgroup freezer
  Freezer: Only show the state of tasks refusing to freeze
2010-04-02 19:44:42 -07:00
Jason Wessel
4da75b9cea kgdb: Turn off tracing while in the debugger
The kernel debugger should turn off kernel tracing any time the
debugger is active and restore it on resume.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
2010-04-02 14:58:19 -05:00
Jason Wessel
ae6bf53e02 kgdb: use atomic_inc and atomic_dec instead of atomic_set
Memory barriers should be used for the kgdb cpu synchronization.  The
atomic_set() does not imply a memory barrier.

Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-04-02 14:58:18 -05:00
Jason Wessel
62fae31219 kgdb: eliminate kgdb_wait(), all cpus enter the same way
This is a kgdb architectural change to have all the cpus (master or
slave) enter the same function.

A cpu that hits an exception (wants to be the master cpu) will call
kgdb_handle_exception() from the trap handler and then invoke a
kgdb_roundup_cpu() to synchronize the other cpus and bring them into
the kgdb_handle_exception() as well.

A slave cpu will enter kgdb_handle_exception() from the
kgdb_nmicallback() and set the exception state to note that the
processor is a slave.

Previously the salve cpu would have called kgdb_wait().  This change
allows the debug core to change cpus without resuming the system in
order to inspect arch specific cpu information.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2010-04-02 14:58:18 -05:00
Jason Wessel
a0279bd580 kgdb: have ebin2mem call probe_kernel_write once
Rather than call probe_kernel_write() one byte at a time, process the
whole buffer locally and pass the entire result in one go.  This way,
architectures that need to do special handling based on the length can
do so, or we only end up calling memcpy() once.

[sonic.zhang@analog.com: Reported original problem and preliminary patch]

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
2010-04-02 14:58:17 -05:00
Peter Zijlstra
371fd7e7a5 sched: Add enqueue/dequeue flags
In order to reduce the dependency on TASK_WAKING rework the enqueue
interface to support a proper flags field.

Replace the int wakeup, bool head arguments with an int flags argument
and create the following flags:

  ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
  ENQUEUE_WAKING - the enqueue has relative vruntime due to
                   having sched_class::task_waking() called,
  ENQUEUE_HEAD - the waking task should be places on the head
                 of the priority queue (where appropriate).

For symmetry also convert sched_class::dequeue() to a flags scheme.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:05 +02:00
Peter Zijlstra
cc87f76a60 sched: Fix nr_uninterruptible count
The cpuload calculation in calc_load_account_active() assumes
rq->nr_uninterruptible will not change on an offline cpu after
migrate_nr_uninterruptible(). However the recent migrate on wakeup
changes broke that and would result in decrementing the offline cpu's
rq->nr_uninterruptible.

Fix this by accounting the nr_uninterruptible on the waking cpu.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:04 +02:00
Peter Zijlstra
65cc8e4859 sched: Optimize task_rq_lock()
Now that we hold the rq->lock over set_task_cpu() again, we can do
away with most of the TASK_WAKING checks and reduce them again to
set_cpus_allowed_ptr().

Removes some conditionals from scheduling hot-paths.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:04 +02:00
Peter Zijlstra
0017d73509 sched: Fix TASK_WAKING vs fork deadlock
Oleg noticed a few races with the TASK_WAKING usage on fork.

 - since TASK_WAKING is basically a spinlock, it should be IRQ safe
 - since we set TASK_WAKING (*) without holding rq->lock it could
   be there still is a rq->lock holder, thereby not actually
   providing full serialization.

(*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

Cure the second issue by not setting TASK_WAKING in sched_fork(), but
only temporarily in wake_up_new_task() while calling select_task_rq().

Cure the first by holding rq->lock around the select_task_rq() call,
this will disable IRQs, this however requires that we push down the
rq->lock release into select_task_rq_fair()'s cgroup stuff.

Because select_task_rq_fair() still needs to drop the rq->lock we
cannot fully get rid of TASK_WAKING.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:03 +02:00
Oleg Nesterov
9084bb8246 sched: Make select_fallback_rq() cpuset friendly
Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
with select_fallback_rq(). It can be called from any context and can't use
any cpuset locks including task_lock(). It is called when the task doesn't
have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
suitable cpu.

I am not proud of this patch. Everything which needs such a fat comment
can't be good even if correct. But I'd prefer to not change the locking
rules in the code I hardly understand, and in any case I believe this
simple change make the code much more correct compared to deadlocks we
currently have.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100315091027.GA9155@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:03 +02:00
Oleg Nesterov
6a1bdc1b57 sched: _cpu_down(): Don't play with current->cpus_allowed
_cpu_down() changes the current task's affinity and then recovers it at
the end. The problems are well known: we can't restore old_allowed if it
was bound to the now-dead-cpu, and we can race with the userspace which
can change cpu-affinity during unplug.

_cpu_down() should not play with current->cpus_allowed at all. Instead,
take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
removes the dying cpu from cpu_online_mask.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100315091023.GA9148@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:03 +02:00
Oleg Nesterov
30da688ef6 sched: sched_exec(): Remove the select_fallback_rq() logic
sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
This can race with other CPUs updating our ->cpus_allowed, and this
looks meaningless to me.

The task is current and running, it must have online cpus in ->cpus_allowed,
the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
this likely means we raced with set_cpus_allowed() which was called
for reason, why should sched_exec() retry and call ->select_task_rq()
again?

Change the code to call sched_class->select_task_rq() directly and do
nothing if the returned cpu is wrong after re-checking under rq->lock.

From now task_struct->cpus_allowed is always stable under TASK_WAKING,
select_fallback_rq() is always called under rq-lock or the caller or
the caller owns TASK_WAKING (select_task_rq).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100315091019.GA9141@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:02 +02:00
Oleg Nesterov
c1804d547d sched: move_task_off_dead_cpu(): Remove retry logic
The previous patch preserved the retry logic, but it looks unneeded.

__migrate_task() can only fail if we raced with migration after we dropped
the lock, but in this case the caller of set_cpus_allowed/etc must initiate
migration itself if ->on_rq == T.

We already fixed p->cpus_allowed, the changes in active/online masks must
be visible to racer, it should migrate the task to online cpu correctly.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100315091014.GA9138@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:02 +02:00
Oleg Nesterov
1445c08d06 sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq()
move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
lockless. We can race with set_cpus_allowed() running in parallel.

Change it to take rq->lock around select_fallback_rq(). Note that it is not
trivial to move this spin_lock() into select_fallback_rq(), we must recheck
the task was not migrated after we take the lock and other callers do not
need this lock.

To avoid the races with other callers of select_fallback_rq() which rely on
TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
CPU anyway, and the subsequent __migrate_task() is just meaningless because
p->se.on_rq must be false.

Alternatively, we could change select_task_rq() to take rq->lock right
after it calls sched_class->select_task_rq(), but this looks a bit ugly.

Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100315091010.GA9131@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:01 +02:00
Oleg Nesterov
897f0b3c3f sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code
This patch just states the fact the cpusets/cpuhotplug interaction is
broken and removes the deadlockable code which only pretends to work.

- cpuset_lock() doesn't really work. It is needed for
  cpuset_cpus_allowed_locked() but we can't take this lock in
  try_to_wake_up()->select_fallback_rq() path.

- cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
  callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
  stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
  cpuset_lock() and hangs forever because CPU is already dead and thus
  T can't be scheduled.

- cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
  which is not irq-safe, but try_to_wake_up() can be called from irq.

Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
we currently do without CONFIG_CPUSETS.

Also, with or without this patch, with or without CONFIG_CPUSETS, the
callers of select_fallback_rq() can race with each other or with
set_cpus_allowed() pathes.

The subsequent patches try to to fix these problems.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100315091003.GA9123@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:01 +02:00
Li Zefan
32bd7eb5a7 sched: Remove remaining USER_SCHED code
This is left over from commit 7c9414385e ("sched: Remove USER_SCHED"")

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Howells <dhowells@redhat.com>
LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:12:00 +02:00
Oleg Nesterov
47a70985e5 sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock
Trivial typo fix. rq->migration_thread can be NULL after
task_rq_unlock(), this is why we have "mt" which should be
 used instead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100330165829.GA18284@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:11:05 +02:00
Mike Galbraith
269484a492 sched: Fix proc_sched_set_task()
Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks
task_times().  Other places in kernel use nvcsw and nivcsw, which are
being cleared as well,  Clear task statistics only.

Reported-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1269940193.19286.14.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:06:40 +02:00
Ingo Molnar
c9494727cf Merge branch 'linus' into sched/core
Merge reason: update to latest upstream

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 20:03:08 +02:00
Ingo Molnar
ec5e61aabe Merge branch 'perf/urgent' into perf/core
Conflicts:
	arch/x86/kernel/cpu/perf_event.c

Merge reason: Resolve the conflict, pick up fixes

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 19:38:10 +02:00
Mike Galbraith
8bb39f9aa0 perf: Fix 'perf sched record' deadlock
perf sched record can deadlock a box should the holder of
handle->data->lock take an interrupt, and then attempt to
acquire an rq lock held by a CPU trying to acquire the
same lock. Disable interrupts.

   CPU0                            CPU1
   sched event with rq->lock held
                                   grab handle->data->lock
   spin on handle->data->lock
                                   interrupt
                                   try to grab rq->lock

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1269598293.6174.8.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-04-02 19:30:05 +02:00
Ingo Molnar
50d11d190a Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/urgent 2010-04-02 19:29:17 +02:00
Frederic Weisbecker
e49a5bd381 perf: Use hot regs with software sched switch/migrate events
Scheduler's task migration events don't work because they always
pass NULL regs perf_sw_event(). The event hence gets filtered
in perf_swevent_add().

Scheduler's context switches events use task_pt_regs() to get
the context when the event occured which is a wrong thing to
do as this won't give us the place in the kernel where we went
to sleep but the place where we left userspace. The result is
even more wrong if we switch from a kernel thread.

Use the hot regs snapshot for both events as they belong to the
non-interrupt/exception based events family. Unlike page faults
or so that provide the regs matching the exact origin of the event,
we need to save the current context.

This makes the task migration event working and fix the context
switch callchains and origin ip.

Example: perf record -a -e cs

Before:

    10.91%      ksoftirqd/0                  0  [k] 0000000000000000
                |
                --- (nil)
                    perf_callchain
                    perf_prepare_sample
                    __perf_event_overflow
                    perf_swevent_overflow
                    perf_swevent_add
                    perf_swevent_ctx_event
                    do_perf_sw_event
                    __perf_sw_event
                    perf_event_task_sched_out
                    schedule
                    run_ksoftirqd
                    kthread
                    kernel_thread_helper

After:

    23.77%  hald-addon-stor  [kernel.kallsyms]  [k] schedule
            |
            --- schedule
               |
               |--60.00%-- schedule_timeout
               |          wait_for_common
               |          wait_for_completion
               |          blk_execute_rq
               |          scsi_execute
               |          scsi_execute_req
               |          sr_test_unit_ready
               |          |
               |          |--66.67%-- sr_media_change
               |          |          media_changed
               |          |          cdrom_media_changed
               |          |          sr_block_media_changed
               |          |          check_disk_change
               |          |          cdrom_open

v2: Always build perf_arch_fetch_caller_regs() now that software
events need that too. They don't need it from modules, unlike trace
events, so we keep the EXPORT_SYMBOL in trace_event_perf.c

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2010-04-01 08:26:31 +02:00
Frederic Weisbecker
eb1e79611c perf: Correctly align perf event tracing buffer
The trace event buffer used by perf to record raw sample events
is typed as an array of char and may then not be aligned to 8
by alloc_percpu().

But we need it to be aligned to 8 in sparc64 because we cast
this buffer into a random structure type built by the TRACE_EVENT()
macro to store the traces. So if a random 64 bits field is accessed
inside, it may be not under an expected good alignment.

Use an array of long instead to force the appropriate alignment, and
perform a compile time check to ensure the size in byte of the buffer
is a multiple of sizeof(long) so that its actual size doesn't get
shrinked under us.

This fixes unaligned accesses reported while using perf lock
in sparc 64.

Suggested-by: David Miller <davem@davemloft.net>
Suggested-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
2010-04-01 08:26:30 +02:00
Steven Rostedt
ff0ff84a07 ring-buffer: Add lost event count to end of sub buffer
Currently, binary readers of the ring buffer only know where events were
lost, but not how many events were lost at that location.
This information is available, but it would require adding another
field to the sub buffer header to include it.

But when a event can not fit at the end of a sub buffer, it is written
to the next sub buffer. This means there is a good chance that the
buffer may have room to hold this counter. If it does, write
the counter at the end of the sub buffer and set another flag
in the data size field that states that this information exists.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-31 22:57:08 -04:00
Steven Rostedt
bc21b47842 tracing: Show the lost events in the trace_pipe output
Now that the ring buffer can keep track of where events are lost.
Use this information to the output of trace_pipe:

       hackbench-3588  [001]  1326.701660: lock_acquire: ffffffff816591e0 read rcu_read_lock
       hackbench-3588  [001]  1326.701661: lock_acquire: ffff88003f4091f0 &(&dentry->d_lock)->rlock
       hackbench-3588  [001]  1326.701664: lock_release: ffff88003f4091f0 &(&dentry->d_lock)->rlock
CPU:1 [LOST 673 EVENTS]
       hackbench-3588  [001]  1326.702711: kmem_cache_free: call_site=ffffffff81102b85 ptr=ffff880026d96738
       hackbench-3588  [001]  1326.702712: lock_release: ffff88003e1480a8 &mm->mmap_sem
       hackbench-3588  [001]  1326.702713: lock_acquire: ffff88003e1480a8 &mm->mmap_sem

Even works with the function graph tracer:

 2) ! 170.098 us  |                                            }
 2)   4.036 us    |                                            rcu_irq_exit();
 2)   3.657 us    |                                            idle_cpu();
 2) ! 190.301 us  |                                          }
CPU:2 [LOST 2196 EVENTS]
 2)   0.853 us    |                            } /* cancel_dirty_page */
 2)               |                            remove_from_page_cache() {
 2)   1.578 us    |                              _raw_spin_lock_irq();
 2)               |                              __remove_from_page_cache() {

Note, it does not work with the iterator "trace" file, since it requires
the use of consuming the page from the ring buffer to determine how many
events were lost, which the iterator does not do.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-31 22:57:06 -04:00
Steven Rostedt
66a8cb95ed ring-buffer: Add place holder recording of dropped events
Currently, when the ring buffer drops events, it does not record
the fact that it did so. It does inform the writer that the event
was dropped by returning a NULL event, but it does not put in any
place holder where the event was dropped.

This is not a trivial thing to add because the ring buffer mostly
runs in overwrite (flight recorder) mode. That is, when the ring
buffer is full, new data will overwrite old data.

In a produce/consumer mode, where new data is simply dropped when
the ring buffer is full, it is trivial to add the placeholder
for dropped events. When there's more room to write new data, then
a special event can be added to notify the reader about the dropped
events.

But in overwrite mode, any new write can overwrite events. A place
holder can not be inserted into the ring buffer since there never
may be room. A reader could also come in at anytime and miss the
placeholder.

Luckily, the way the ring buffer works, the read side can find out
if events were lost or not, and how many events. Everytime a write
takes place, if it overwrites the header page (the next read) it
updates a "overrun" variable that keeps track of the number of
lost events. When a reader swaps out a page from the ring buffer,
it can record this number, perfom the swap, and then check to
see if the number changed, and take the diff if it has, which would be
the number of events dropped. This can be stored by the reader
and returned to callers of the reader.

Since the reader page swap will fail if the writer moved the head
page since the time the reader page set up the swap, this gives room
to record the overruns without worrying about races. If the reader
sets up the pages, records the overrun, than performs the swap,
if the swap succeeds, then the overrun variable has not been
updated since the setup before the swap.

For binary readers of the ring buffer, a flag is set in the header
of each sub page (sub buffer) of the ring buffer. This flag is embedded
in the size field of the data on the sub buffer, in the 31st bit (the size
can be 32 or 64 bits depending on the architecture), but only 27
bits needs to be used for the actual size (less actually).

We could add a new field in the sub buffer header to also record the
number of events dropped since the last read, but this will change the
format of the binary ring buffer a bit too much. Perhaps this change can
be made if the information on the number of events dropped is considered
important enough.

Note, the notification of dropped events is only used by consuming reads
or peeking at the ring buffer. Iterating over the ring buffer does not
keep this information because the necessary data is only available when
a page swap is made, and the iterator does not swap out pages.

Cc: Robert Richter <robert.richter@amd.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-31 22:57:04 -04:00
Steven Rostedt
eb0c53771f tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set
If modules are configured in the build but unloading of modules is not,
then the refcnt is not defined. Place the get/put module tracepoints
under CONFIG_MODULE_UNLOAD since it references this field in the module
structure.

As a side-effect, this patch also reduces the code when MODULE_UNLOAD
is not set, because these unused tracepoints are not created.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-31 22:56:59 -04:00
Li Zefan
ae832d1e03 tracing: Remove side effect from module tracepoints that caused a GPF
Remove the @refcnt argument, because it has side-effects, and arguments with
side-effects are not skipped by the jump over disabled instrumentation and are
executed even when the tracepoint is disabled.

This was also causing a GPF as found by Randy Dunlap:

Subject: 2.6.33 GP fault only when built with tracing
LKML-Reference: <4BA2B69D.3000309@oracle.com>

Note, the current 2.6.34-rc has a fix for the actual cause of the GPF,
but this fixes one of its triggers.

Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FA7.6040406@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-31 22:56:58 -04:00
Thomas Gleixner
753649dbc4 genirq: Force MSI irq handlers to run with interrupts disabled
Network folks reported that directing all MSI-X vectors of their multi
queue NICs to a single core can cause interrupt stack overflows when
enough interrupts fire at the same time.

This is caused by the fact that we run interrupt handlers by default
with interrupts enabled unless the driver reuqests the interrupt with
the IRQF_DISABLED set. The NIC handlers do not set this flag, so
simultaneous interrupts can nest unlimited and cause the stack
overflow.

The only safe counter measure is to run the interrupt handlers with
interrupts disabled. We can't switch to this mode in general right
now, but it is safe to do so for MSI interrupts.

Force IRQF_DISABLED for MSI interrupt handlers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Linus Torvalds <torvalds@osdl.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: stable@kernel.org
2010-03-31 15:48:38 +02:00
Linus Torvalds
246750ffa1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  CRED: Fix memory leak in error handling
2010-03-30 07:26:30 -07:00
Linus Torvalds
be3fd3cc7c Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Do not free zero sized per cpu areas
  x86: Make sure free_init_pages() frees pages on page boundary
  x86: Make smp_locks end with page alignment
2010-03-30 07:22:38 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Mathieu Desnoyers
570b8fb505 CRED: Fix memory leak in error handling
Fix a memory leak on an OOM condition in prepare_usermodehelper_creds().

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-03-30 17:15:38 +11:00
Julia Lawall
292f60c0c4 ring-buffer: Add missing unlock
In some error handling cases the lock is not unlocked.  The return is
converted to a goto, to share the unlock at the end of the function.

A simplified version of the semantic patch that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
expression E1;
identifier f;
@@

f (...) { <+...
* spin_lock_irq (E1,...);
... when != E1
* return ...;
...+> }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
LKML-Reference: <Pine.LNX.4.64.1003291736440.21896@ask.diku.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-29 15:23:24 -04:00
Li Zefan
e36673ec51 tracing: Fix lockdep warning in global_clock()
# echo 1 > events/enable
 # echo global > trace_clock

------------[ cut here ]------------
WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190()
...
---[ end trace 3f86734a89416623 ]---
possible reason: unannotated irqs-on.
...

There's no reason to use the raw_local_irq_save() in trace_clock_global.
The local_irq_save() version is fine, and does not cause the bug in lockdep.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FA1.7030606@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-29 15:16:44 -04:00
Ian Campbell
eed63519e3 x86: Do not free zero sized per cpu areas
This avoids an infinite loop in free_early_partial().

Add a warning to free_early_partial() to catch future problems.

-v5: put back start > end back into WARN_ONCE()
-v6: use one line for warning, suggested by Linus
-v7: more tests
-v8: remove the function name as suggested by Johannes
     WARN_ONCE() will print out that function name.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Joel Becker <joel.becker@oracle.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-4-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 18:55:40 +02:00
David Howells
a53f4f9efa SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG
CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all
instances.  Change the remaining instances.  This makes the debugfs file
display the time mark and the owner's description again.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-29 09:14:47 -07:00
Dave Airlie
88be12c440 slow-work: use get_ref wrapper instead of directly calling get_ref
Otherwise we can get an oops if the user has no get_ref/put_ref
requirement.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-29 09:13:30 -07:00
Tejun Heo
10fad5e46f percpu, module: implement and use is_kernel/module_percpu_address()
lockdep has custom code to check whether a pointer belongs to static
percpu area which is somewhat broken.  Implement proper
is_kernel/module_percpu_address() and replace the custom code.

On UP, percpu variables are regular static variables and can't be
distinguished from them.  Always return %false on UP.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@redhat.com>
2010-03-29 23:07:12 +09:00
Tejun Heo
259354deaa module: encapsulate percpu handling better and record percpu_size
Better encapsulate module static percpu area handling so that code
outsidef of CONFIG_SMP ifdef doesn't deal with mod->percpu directly
and add mod->percpu_size and record percpu_size in it.  Both percpu
fields are compiled out on UP.  While at it, mark mod->percpu w/
__percpu.

This is to prepare for is_module_percpu_address().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
2010-03-29 23:07:12 +09:00
Henrik Kretzschmar
975d260355 padata: Section cleanup
This patch removes the __cupinit from padata_cpu_callback(),
which is refered by the exportet function padata_alloc().

This could lead to problems if CONFIG_HOTPLUG_CPU is disabled,
which should happen very often.

WARNING: kernel/built-in.o(.text+0x7ffcb): Section mismatch in reference from the function padata_alloc() to the function .cpuinit.text:padata_cpu_callback()
The function padata_alloc() references
the function __cpuinit padata_cpu_callback().
This is often because padata_alloc lacks a __cpuinit
annotation or the annotation of padata_cpu_callback is wrong.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-03-29 16:15:31 +08:00
Linus Torvalds
b72c40949b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
  x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1
  x86/PCI: for host bridge address space collisions, show conflicting resource
  frv/PCI: remove redundant warnings
  x86/PCI: remove redundant warnings
  PCI: don't say we claimed a resource if we failed
  PCI quirk: Disable MSI on VIA K8T890 systems
  PCI quirk: RS780/RS880: work around missing MSI initialization
  PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present
  PCI: complain about devices that seem to be broken
  PCI: print resources consistently with %pR
  PCI: make disabled window printk style match the enabled ones
  PCI: break out primary/secondary/subordinate for readability
  PCI: for address space collisions, show conflicting resource
  resources: add interfaces that return conflict information
  PCI: cleanup error return for pcix get and set mmrbc functions
  PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions
  PCI: kill off pci_register_set_vga_state() symbol export.
  PCI: fix return value from pcix_get_max_mmrbc()
2010-03-26 16:34:29 -07:00
Matt Helsley
5a7aadfe2f Freezer: Fix buggy resume test for tasks frozen with cgroup freezer
When the cgroup freezer is used to freeze tasks we do not want to thaw
those tasks during resume. Currently we test the cgroup freezer
state of the resuming tasks to see if the cgroup is FROZEN.  If so
then we don't thaw the task. However, the FREEZING state also indicates
that the task should remain frozen.

This also avoids a problem pointed out by Oren Ladaan: the freezer state
transition from FREEZING to FROZEN is updated lazily when userspace reads
or writes the freezer.state file in the cgroup filesystem. This means that
resume will thaw tasks in cgroups which should be in the FROZEN state if
there is no read/write of the freezer.state file to trigger this
transition before suspend.

NOTE: Another "simple" solution would be to always update the cgroup
freezer state during resume. However it's a bad choice for several reasons:
Updating the cgroup freezer state is somewhat expensive because it requires
walking all the tasks in the cgroup and checking if they are each frozen.
Worse, this could easily make resume run in N^2 time where N is the number
of tasks in the cgroup. Finally, updating the freezer state from this code
path requires trickier locking because of the way locks must be ordered.

Instead of updating the freezer state we rely on the fact that lazy
updates only manage the transition from FREEZING to FROZEN. We know that
a cgroup with the FREEZING state may actually be FROZEN so test for that
state too. This makes sense in the resume path even for partially-frozen
cgroups -- those that really are FREEZING but not FROZEN.

Reported-by: Oren Ladaan <orenl@cs.columbia.edu>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-03-26 23:51:44 +01:00
Xiaotian Feng
4f598458ea Freezer: Only show the state of tasks refusing to freeze
show_state will dump all tasks state, so if freezer failed to freeze
any task, kernel will dump all tasks state and flood the dmesg log.
This patch makes freezer only show state of tasks refusing to freeze.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-03-26 23:51:13 +01:00
Linus Torvalds
054319b5e2 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  time: Fix accumulation bug triggered by long delay.
  posix-cpu-timers: Reset expire cache when no timer is running
  timer stats: Fix del_timer_sync() and try_to_del_timer_sync()
  clockevents: Sanitize min_delta_ns adjustment and prevent overflows
2010-03-26 15:10:38 -07:00
Linus Torvalds
833961d81f Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align
2010-03-26 15:10:13 -07:00
Linus Torvalds
3cacf42462 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Use proper type in sched_getaffinity()
  kernel/sched.c: Suppress unused var warning
  sched: sched_getaffinity(): Allow less than NR_CPUS length
2010-03-26 15:09:59 -07:00
Linus Torvalds
309d1dcb5b Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Move two IRQ functions from .init.text to .text
  genirq: Protect access to irq_desc->action in can_request_irq()
  genirq: Prevent oneshot irq thread race
2010-03-26 15:09:06 -07:00
Linus Torvalds
8128f55a0b Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Remove excessive early_res debug output
  softlockup: Stop spurious softlockup messages due to overflow
  rcu: Fix local_irq_disable() CONFIG_PROVE_RCU=y false positives
  rcu: Fix tracepoints & lockdep false positive
  rcu: Make rcu_read_lock_bh_held() allow for disabled BH
2010-03-26 15:08:31 -07:00
Peter Zijlstra
faa4602e47 x86, perf, bts, mm: Delete the never used BTS-ptrace code
Support for the PMU's BTS features has been upstreamed in
v2.6.32, but we still have the old and disabled ptrace-BTS,
as Linus noticed it not so long ago.

It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without
regard for other uses (perf) and doesn't provide the flexibility
needed for perf either.

Its users are ptrace-block-step and ptrace-bts, since ptrace-bts
was never used and ptrace-block-step can be implemented using a
much simpler approach.

So axe all 3000 lines of it. That includes the *locked_memory*()
APIs in mm/mlock.c as well.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Markus Metzger <markus.t.metzger@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100325135413.938004390@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-26 11:33:55 +01:00
Miao Xie
53feb29767 cpuset: alloc nodemask_t on the heap rather than the stack
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:21 -07:00
Miao Xie
5ab116c934 cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node
cpuset_mem_spread_node() returns an offline node, and causes an oops.

This patch fixes it by initializing task->mems_allowed to
node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing
memory hotplug.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Reported-by: Nick Piggin <npiggin@suse.de>
Tested-by: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:21 -07:00
Li Zefan
9d34706f42 cgroups: remove duplicate include
commit e6a1105b ("cgroups: subsystem module loading interface") and commit
c50cc752 ("sched, cgroups: Fix module export") result in duplicate
including of module.h

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:19 -07:00
Henrik Kretzschmar
860652bfb8 genirq: Move two IRQ functions from .init.text to .text
Both functions should not be marked as __init, since they be called
from modules after the init section is freed.

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
LKML-Reference: <1269431961-5731-1-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-24 14:38:23 +01:00
Thomas Gleixner
cc8c3b7843 genirq: Protect access to irq_desc->action in can_request_irq()
can_request_irq() accesses and dereferences irq_desc->action w/o
holding irq_desc->lock. So action can be freed on another CPU before
it's dereferenced. Unlikely, but ...

Protect it with desc->lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-24 14:38:23 +01:00
Dimitri Sivanich
92d6b71ab9 genirq: Expose irq_desc->node in proc/irq
Expose irq_desc->node as /proc/irq/*/node.

This file provides device hardware locality information for apps
desiring to include hardware locality in irq mapping decisions.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-24 14:10:03 +01:00
Bjorn Helgaas
66f1207bce resources: add interfaces that return conflict information
request_resource() and insert_resource() only return success or failure,
which no information about what existing resource conflicted with the
proposed new reservation.  This patch adds request_resource_conflict()
and insert_resource_conflict(), which return the conflicting resource.

Callers may use this for better error messages or to adjust the new
resource and retry the request.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-03-23 13:33:50 -07:00
John Stultz
e1292ba164 ntp: Make time_adjust static
Now that no arches are accessing time_adjust directly,
make it static.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1268968769-19209-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-23 17:19:37 +01:00
John Stultz
830ec0458c time: Fix accumulation bug triggered by long delay.
The logarithmic accumulation done in the timekeeping has some overflow
protection that limits the max shift value. That means it will take
more then shift loops to accumulate all of the cycles. This causes
the shift decrement to underflow, which causes the loop to never exit.

The simplest fix would be simply to do a:
	if (shift)
		shift--;

However that is not optimal, as we know the cycle offset is larger
then the interval << shift, the above would make shift drop to zero,
then we would be spinning for quite awhile accumulating at interval
chunks at a time.

Instead, this patch only decreases shift if the offset is smaller
then cycle_interval << shift.  This makes sure we accumulate using
the largest chunks possible without overflowing tick_length, and limits
the number of iterations through the loop.

This issue was found and reported by Sonic Zhang, who also tested the fix.
Many thanks your explanation and testing!

Reported-by: Sonic Zhang <sonic.adi@gmail.com>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Tested-by: Sonic Zhang <sonic.adi@gmail.com>
LKML-Reference: <1268948850-5225-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-23 16:41:01 +01:00
Ingo Molnar
d2f1e15b66 Merge commit 'v2.6.34-rc2' into perf/core
Merge reason: Pick up latest perf fixes from upstream.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-22 18:47:01 +01:00
Colin Ian King
8c2eb4805d softlockup: Stop spurious softlockup messages due to overflow
Ensure additions on touch_ts do not overflow.  This can occur
when the top 32 bits of the TSC reach 0xffffffff causing
additions to touch_ts to overflow and this in turn generates
spurious softlockup warnings.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <1268994482.1798.6.camel@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-21 19:30:13 +01:00
Steven Rostedt
2271048d1b ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align
The ring buffer uses 4 byte alignment while recording events into the
buffer, even on 64bit machines. This saves space when there are lots
of events being recorded at 4 byte boundaries.

The ring buffer has a zero copy method to write into the buffer, with
the reserving of space and then committing it. This may cause problems
when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and
PPC this is not an issue, but on some architectures this would cause an
out-of-alignment exception.

This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
if it is OK to use 4 byte alignments on 64 bit machines. If it is not,
it forces the ring buffer event header to be 8 bytes and not 4,
and will align the length of the data to be 8 byte aligned.
This keeps the data payload at 8 byte alignments and will allow these
machines to run without issue.

The trick to this is that the header can be either 4 bytes or 8 bytes
depending on the length of the data payload. The 4 byte header
has a length field that supports up to 112 bytes. If the length of
the data is more than 112, the length field is set to zero, and the actual
length is stored in the next 4 bytes after the header.

When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces
zero in the 4 byte header forcing the length to be stored in the 4 byte
array, even with a small data load. It also forces the length of the
data load to be 8 byte aligned. The combination of these two guarantee
that the data is always at 8 byte alignment.

Tested-by: Frederic Weisbecker <fweisbec@gmail.com>
           (on sparc64)
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-18 23:11:35 -04:00
Linus Torvalds
f82c37e7bb Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
  perf: Fix unexported generic perf_arch_fetch_caller_regs
  perf record: Don't try to find buildids in a zero sized file
  perf: export perf_trace_regs and perf_arch_fetch_caller_regs
  perf, x86: Fix hw_perf_enable() event assignment
  perf, ppc: Fix compile error due to new cpu notifiers
  perf: Make the install relative to DESTDIR if specified
  kprobes: Calculate the index correctly when freeing the out-of-line execution slot
  perf tools: Fix sparse CPU numbering related bugs
  perf_event: Fix oops triggered by cpu offline/online
  perf: Drop the obsolete profile naming for trace events
  perf: Take a hot regs snapshot for trace events
  perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
  perf/x86-64: Use frame pointer to walk on irq and process stacks
  lockdep: Move lock events under lockdep recursion protection
  perf report: Print the map table just after samples for which no map was found
  perf report: Add multiple event support
  perf session: Change perf_session post processing functions to take histogram tree
  perf session: Add storage for seperating event types in report
  perf session: Change add_hist_entry to take the tree root instead of session
  perf record: Add ID and to recorded event data when recording multiple events
  ...
2010-03-18 16:52:46 -07:00
Frederic Weisbecker
dcd5c1662d perf: Fix unexported generic perf_arch_fetch_caller_regs
perf_arch_fetch_caller_regs() is exported for the overriden x86
version, but not for the generic weak version.

As a general rule, weak functions should not have their symbol
exported in the same file they are defined.

So let's export it on trace_event_perf.c as it is used by trace
events only.

This fixes:

	ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined!
	ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined!

-v2: And also only build it if trace events are enabled.
-v3: Fix changelog mistake

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-17 12:26:49 +01:00
KOSAKI Motohiro
8bc037fb89 sched: Use proper type in sched_getaffinity()
Using the proper type fixes the following compiler warning:

  kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: torvalds@linux-foundation.org
Cc: travis@sgi.com
Cc: peterz@infradead.org
Cc: drepper@redhat.com
Cc: rja@sgi.com
Cc: sharyath@in.ibm.com
Cc: steiner@sgi.com
LKML-Reference: <20100317090046.4C79.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-17 10:48:49 +01:00
Thomas Weber
8839316121 Fix typos in comments
[Ss]ytem => [Ss]ystem
udpate => update
paramters => parameters
orginal => original

Signed-off-by: Thomas Weber <swirl@gmx.li>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-03-16 11:47:56 +01:00
Andrew Morton
c890692bf3 kernel/sched.c: Suppress unused var warning
On UP:

  kernel/sched.c: In function 'wake_up_new_task':
  kernel/sched.c:2631: warning: unused variable 'cpu'

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-16 11:13:42 +01:00
Dan Carpenter
6427462bfa sched: Remove some dead code
This was left over from "7c9414385e sched: Remove USER_SCHED"

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Dhaval Giani <dhaval.giani@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
LKML-Reference: <20100315082148.GD18181@bicker>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-16 11:05:44 +01:00
Paul E. McKenney
e3818b8dce rcu: Make rcu_read_lock_bh_held() allow for disabled BH
Disabling BH can stand in for rcu_read_lock_bh(), and this patch
updates rcu_read_lock_bh_held() to allow for this.  In order to
avoid include-file hell, this function is moved out of line to
kernel/rcupdate.c.

This fixes a false positive RCU warning.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <20100316000343.GA25857@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-16 09:57:49 +01:00
Frederic Weisbecker
1d199b1ad6 perf: Fix unexported generic perf_arch_fetch_caller_regs
perf_arch_fetch_caller_regs() is exported for the overriden x86
version, but not for the generic weak version.

As a general rule, weak functions should not have their symbol
exported in the same file they are defined.

So let's export it on trace_event_perf.c as it is used by trace
events only.

This fixes:

	ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined!
	ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined!

-v2: And also only build it if trace events are enabled.
-v3: Fix changelog mistake

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-16 09:27:27 +01:00
KOSAKI Motohiro
cd3d8031eb sched: sched_getaffinity(): Allow less than NR_CPUS length
[ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]

Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
Unfortunately, glibc sched interface has the following definition:

	# define __CPU_SETSIZE  1024
	# define __NCPUBITS     (8 * sizeof (__cpu_mask))
	typedef unsigned long int __cpu_mask;
	typedef struct
	{
	  __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
	} cpu_set_t;

It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
ABI issue ...

More recently, Sharyathi Nagesh reported following test program makes
misterious syscall failure:

 -----------------------------------------------------------------------
 #define _GNU_SOURCE
 #include<stdio.h>
 #include<errno.h>
 #include<sched.h>

 int main()
 {
     cpu_set_t set;
     if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
         printf("\n Call is failing with:%d", errno);
 }
 -----------------------------------------------------------------------

Because the kernel assumes len argument of sched_getaffinity() is bigger
than NR_CPUS. But now it is not correct.

Now we are faced with the following annoying dilemma, due to
the limitations of the glibc interface built in years ago:

 (1) if we change glibc's __CPU_SETSIZE definition, we lost
     binary compatibility of _all_ application.

 (2) if we don't change it, we also lost binary compatibility of
     Sharyathi's use case.

Then, I would propse to change the rule of the len argument of
sched_getaffinity().

Old:
	len should be bigger than NR_CPUS
New:
	len should be bigger than maximum possible cpu id

This creates the following behavior:

 (A) In the real 4096 cpus machine, the above test program still
     return -EINVAL.

 (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
     all machines in the world), the above can run successfully.

Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
they can rebuild their programs.

IOW we hope they are not annoyed by this issue ...

Reported-by: Sharyathi Nagesh <sharyath@in.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-15 08:28:44 +01:00
Ingo Molnar
12b8aeee3e Merge branch 'linus' into timers/core
Conflicts:
	Documentation/feature-removal-schedule.txt

Merge reason: Resolve the conflict, update to upstream.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-15 08:17:33 +01:00
Linus Torvalds
80a186074e Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix pick_next_highest_task_rt() for cgroups
  sched: Cleanup: remove unused variable in try_to_wake_up()
  x86: Fix sched_clock_cpu for systems with unsynchronized TSC
2010-03-13 14:46:18 -08:00
Linus Torvalds
4e3eaddd14 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  locking: Make sparse work with inline spinlocks and rwlocks
  x86/mce: Fix RCU lockdep splats
  rcu: Increase RCU CPU stall timeouts if PROVE_RCU
  ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
  rcu: Suppress RCU lockdep warnings during early boot
  rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
  rcu: Suppress __mpol_dup() false positive from RCU lockdep
  rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
  rcu: Add control variables to lockdep_rcu_dereference() diagnostics
  rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
  rcu: Use wrapper function instead of exporting tasklist_lock
  sched, rcu: Fix rcu_dereference() for RCU-lockdep
  rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
  rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
  x86/gart: Unexport gart_iommu_aperture

Fix trivial conflicts in kernel/trace/ftrace.c
2010-03-13 14:43:01 -08:00
Linus Torvalds
8655e7e3dd Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Do not record user stack trace from NMI context
  tracing: Disable buffer switching when starting or stopping trace
  tracing: Use same local variable when resetting the ring buffer
  function-graph: Init curr_ret_stack with ret_stack
  ring-buffer: Move disabled check into preempt disable section
  function-graph: Add tracing_thresh support to function_graph tracer
  tracing: Update the comm field in the right variable in update_max_tr
  function-graph: Use comment notation for func names of dangling '}'
  function-graph: Fix unused reference to ftrace_set_func()
  tracing: Fix warning in s_next of trace file ops
  tracing: Include irqflags headers from trace clock
2010-03-13 14:40:50 -08:00
Linus Torvalds
9fdfbc2bff Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: Provide generic perf_sample_data initialization
  MAINTAINERS: Add Arnaldo as tools/perf/ co-maintainer
  perf trace: Don't use pager if scripting
  perf trace/scripting: Remove extraneous header read
  perf, ARM: Modify kuser rmb() call to compile for Thumb-2
  x86/stacktrace: Don't dereference bad frame pointers
  perf archive: Don't try to collect files without a build-id
  perf_events, x86: Fixup fixed counter constraints
  perf, x86: Restrict the ANY flag
  perf, x86: rename macro in ARCH_PERFMON_EVENTSEL_ENABLE
  perf, x86: add some IBS macros to perf_event.h
  perf, x86: make IBS macros available in perf_event.h
  hw-breakpoints: Remove stub unthrottle callback
  x86/hw-breakpoints: Remove the name field
  perf: Remove pointless breakpoint union
  perf lock: Drop the buffers multiplexing dependency
  perf lock: Fix and add misc documentally things
  percpu: Add __percpu sparse annotations to hw_breakpoint
2010-03-13 14:39:42 -08:00
Steven Rostedt
b6345879cc tracing: Do not record user stack trace from NMI context
A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 <page_fault>:
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-12 20:31:49 -05:00
Steven Rostedt
a2f8071428 tracing: Disable buffer switching when starting or stopping trace
When the trace iterator is read, tracing_start() and tracing_stop()
is called to stop tracing while the iterator is processing the trace
output.

These functions disable both the standard buffer and the max latency
buffer. But if the wakeup tracer is running, it can switch these
buffers between the two disables:

  buffer = global_trace.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

      <<<--------- swap happens here

  buffer = max_tr.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

What happens is that we disabled the same buffer twice. On tracing_start()
we can enable the same buffer twice. All ring_buffer_record_disable()
must be matched with a ring_buffer_record_enable() or the buffer
can be disable permanently, or enable prematurely, and cause a bug
where a reset happens while a trace is commiting.

This patch protects these two by taking the ftrace_max_lock to prevent
a switch from occurring.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-12 20:30:21 -05:00
Steven Rostedt
283740c619 tracing: Use same local variable when resetting the ring buffer
In the ftrace code that resets the ring buffer it references the
buffer with a local variable, but then uses the tr->buffer as the
parameter to reset. If the wakeup tracer is running, which can
switch the tr->buffer with the max saved buffer, this can break
the requirement of disabling the buffer before the reset.

   buffer = tr->buffer;
   ring_buffer_record_disable(buffer);
   synchronize_sched();
   __tracing_reset(tr->buffer, cpu);

If the tr->buffer is swapped, then the reset is not happening to the
buffer that was disabled. This will cause the ring buffer to fail.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-12 20:29:20 -05:00
Steven Rostedt
ea14eb7140 function-graph: Init curr_ret_stack with ret_stack
If the graph tracer is active, and a task is forked but the allocating of
the processes graph stack fails, it can cause crash later on.

This is due to the temporary stack being NULL, but the curr_ret_stack
variable is copied from the parent. If it is not -1, then in
ftrace_graph_probe_sched_switch() the following:

	for (index = next->curr_ret_stack; index >= 0; index--)
		next->ret_stack[index].calltime += timestamp;

Will cause a kernel OOPS.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-12 20:28:02 -05:00
Lai Jiangshan
52fbe9cde7 ring-buffer: Move disabled check into preempt disable section
The ring buffer resizing and resetting relies on a schedule RCU
action. The buffers are disabled, a synchronize_sched() is called
and then the resize or reset takes place.

But this only works if the disabling of the buffers are within the
preempt disabled section, otherwise a window exists that the buffers
can be written to while a reset or resize takes place.

Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B949E43.2010906@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-12 20:26:56 -05:00
Linus Torvalds
64d5aea300 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timekeeping: Prevent oops when GENERIC_TIME=n
2010-03-12 16:27:08 -08:00
Linus Torvalds
c32da02342 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
  doc: fix typo in comment explaining rb_tree usage
  Remove fs/ntfs/ChangeLog
  doc: fix console doc typo
  doc: cpuset: Update the cpuset flag file
  Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
  Remove drivers/parport/ChangeLog
  Remove drivers/char/ChangeLog
  doc: typo - Table 1-2 should refer to "status", not "statm"
  tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
  No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
  devres/irq: Fix devm_irq_match comment
  Remove reference to kthread_create_on_cpu
  tree-wide: Assorted spelling fixes
  tree-wide: fix 'lenght' typo in comments and code
  drm/kms: fix spelling in error message
  doc: capitalization and other minor fixes in pnp doc
  devres: typo fix s/dev/devm/
  Remove redundant trailing semicolons from macros
  fix typo "definetly" -> "definitely" in comment
  tree-wide: s/widht/width/g typo in comments
  ...

Fix trivial conflict in Documentation/laptops/00-INDEX
2010-03-12 16:04:50 -08:00
Dave Young
2edf5e4980 sysctl extern cleanup: lockdep
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move lockdep extern declarations to linux/lockdep.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:53:10 -08:00
Dave Young
4f0e056fde sysctl extern cleanup: rtmutex
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move max_lock_depth extern declaration to linux/rtmutex.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:53:10 -08:00
Dave Young
c55b7c3e82 sysctl extern cleanup: acct
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move acct_parm extern declaration to linux/acct.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:53:10 -08:00
Dave Young
15485a4682 sysctl extern cleanup: sg
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move sg_big_buff extern declaration to scsi/sg.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Doug Gilbert <dgilbert@interlog.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:53:10 -08:00
Dave Young
5ed109103d sysctl extern cleanup: module
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move modprobe_path extern declaration to linux/kmod.h
Move modules_disabled extern declaration to linux/module.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:53:10 -08:00
Dave Young
e5ab67726f sysctl extern cleanup: rcu
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move rcutorture_runnable extern declaration to linux/rcupdate.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:53:08 -08:00
Dave Young
d33ed52d57 sysctl extern cleanup: signal
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move print_fatal_signals extern declaration to linux/signal.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:44 -08:00
Dave Young
eb5572fed5 sysctl extern cleanup: C_A_D
Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move C_A_D extern variable declaration to linux/reboot.h

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:44 -08:00
Alexey Dobriyan
8467005da3 nsproxy: remove INIT_NSPROXY()
Remove INIT_NSPROXY(), use C99 initializer.
Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it.

Note: headers trim will be done later, now it's quite pointless because
results will be invalidated by merge window.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:40 -08:00
Oleg Nesterov
13aa9a6b0f pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO instead of force_sig()
zap_pid_ns_processes() uses force_sig(SIGKILL) to ensure SIGKILL will be
delivered to sub-namespace inits as well.  This is correct, but we are
going to change force_sig_info() semantics.  See
http://bugzilla.kernel.org/show_bug.cgi?id=15395#c31

We can use send_sig_info(SEND_SIG_NOINFO) instead, since
614c517d7c ("signals: SEND_SIG_NOINFO should
be considered as SI_FROMUSER()") SEND_SIG_NOINFO means "from user" and
therefore send_signal() will get the correct from_ancestor_ns = T flag.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:40 -08:00
Veaceslav Falico
93c59907c6 copy_signal() cleanup: clean thread_group_cputime_init()
Remove unneeded initializations in thread_group_cputime_init() and in
posix_cpu_timers_init_group().  They are useless after kmem_cache_zalloc()
was used in copy_signal().

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:39 -08:00
Veaceslav Falico
4dd66e69d4 copy_signal() cleanup: kill taskstats_tgid_init() and acct_init_pacct()
Kill unused functions taskstats_tgid_init() and acct_init_pacct() because
we don't use them anywhere after using kmem_cache_zalloc() in
copy_signal().

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:39 -08:00
Veaceslav Falico
a56704ef6b copy_signal() cleanup: use zalloc and remove initializations
Use kmem_cache_zalloc() on signal creation and remove unneeded
initialization lines in copy_signal().

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:39 -08:00
Kirill A. Shutemov
a0a4db548e cgroups: remove events before destroying subsystem state objects
Events should be removed after rmdir of cgroup directory, but before
destroying subsystem state objects.  Let's take reference to cgroup
directory dentry to do that.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:37 -08:00
Kirill A. Shutemov
4ab78683c1 cgroups: fix race between userspace and kernelspace
Notify userspace about cgroup removing only after rmdir of cgroup
directory to avoid race between userspace and kernelspace.

eventfd are used to notify about two types of event:
 - control file-specific, like crossing memory threshold;
 - cgroup removing.

To understand what really happen, userspace can check if the cgroup still
exists.  To avoid race beetween userspace and kernelspace we have to
notify userspace about cgroup removing only after rmdir of cgroup
directory.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:37 -08:00
Kirill A. Shutemov
0dea116876 cgroup: implement eventfd-based generic API for notifications
This patchset introduces eventfd-based API for notifications in cgroups
and implements memory notifications on top of it.

It uses statistics in memory controler to track memory usage.

Output of time(1) on building kernel on tmpfs:

Root cgroup before changes:
	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
Non-root cgroup before changes:
	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
Root cgroup after changes (0 thresholds):
	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
Non-root cgroup after changes (0 thresholds):
	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
Root cgroup after changes (1 thresholds, never crossed):
	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
Non-root cgroup after changes (1 thresholds, never crossed):
	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total

This patch:

Introduce the write-only file "cgroup.event_control" in every cgroup.

To register new notification handler you need:
- create an eventfd;
- open a control file to be monitored. Callbacks register_event() and
  unregister_event() must be defined for the control file;
- write "<event_fd> <control_fd> <args>" to cgroup.event_control.
  Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

If you need notification functionality for a control file you have to
implement callbacks register_event() and unregister_event() in the
struct cftype.

[kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:37 -08:00
Li Zefan
b70cc5fdb4 cgroups: clean up cgroup_pidlist_find() a bit
Don't call get_pid_ns() before we locate/alloc the ns.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:36 -08:00
Ben Blum
67523c48aa cgroups: blkio subsystem as module
Modify the Block I/O cgroup subsystem to be able to be built as a module.
As the CFQ disk scheduler optionally depends on blk-cgroup, config options
in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
enhanced to support the new module dependency.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:36 -08:00
Ben Blum
cf5d5941fd cgroups: subsystem module unloading
Provides support for unloading modular subsystems.

This patch adds a new function cgroup_unload_subsys which is to be used
for removing a loaded subsystem during module deletion.  Reference
counting of the subsystems' modules is moved from once (at load time) to
once per attached hierarchy (in parse_cgroupfs_options and
rebind_subsystems) (i.e., 0 or 1).

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:36 -08:00
Ben Blum
e6a1105ba0 cgroups: subsystem module loading interface
Add interface between cgroups subsystem management and module loading

This patch implements rudimentary module-loading support for cgroups -
namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
module initcall, and a struct module pointer in struct cgroup_subsys.

Several functions that might be wanted by modules have had EXPORT_SYMBOL
added to them, but it's unclear exactly which functions want it and which
won't.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:36 -08:00
Ben Blum
aae8aab403 cgroups: revamp subsys array
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.

It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.

Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.

Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.

Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.

Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.

Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.

This patch:

Make subsys[] able to be dynamically populated to support modular
subsystems

This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:36 -08:00
Daisuke Nishimura
d7b9fff711 cgroup: introduce coalesce css_get() and css_put()
Current css_get() and css_put() increment/decrement css->refcnt one by
one.

This patch add a new function __css_get(), which takes "count" as a arg
and increment the css->refcnt by "count".  And this patch also add a new
arg("count") to __css_put() and change the function to decrement the
css->refcnt by "count".

These coalesce version of __css_get()/__css_put() will be used to improve
performance of memcg's moving charge feature later, where instead of
calling css_get()/css_put() repeatedly, these new functions will be used.

No change is needed for current users of css_get()/css_put().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:36 -08:00
Daisuke Nishimura
2468c7234b cgroup: introduce cancel_attach()
Add cancel_attach() operation to struct cgroup_subsys.  cancel_attach()
can be used when can_attach() operation prepares something for the subsys,
but we should rollback what can_attach() operation has prepared if attach
task fails after we've succeeded in can_attach().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:35 -08:00
Christoph Hellwig
5cacdb4add Add generic sys_olduname()
Add generic implementations of the old and really old uname system calls.
Note that sh only implements sys_olduname but not sys_oldolduname, but I'm
not going to bother with another ifdef for that special case.

m32r implemented an old uname but never wired it up, so kill it, too.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:32 -08:00
Christoph Hellwig
e28cbf2293 improve sys_newuname() for compat architectures
On an architecture that supports 32-bit compat we need to override the
reported machine in uname with the 32-bit value.  Instead of doing this
separately in every architecture introduce a COMPAT_UTS_MACHINE define in
<asm/compat.h> and apply it directly in sys_newuname().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:32 -08:00
Christoph Hellwig
baed7fc9b5 Add generic sys_ipc wrapper
Add a generic implementation of the ipc demultiplexer syscall.  Except for
s390 and sparc64 all implementations of the sys_ipc are nearly identical.

There are slight differences in the types of the parameters, where mips
and powerpc as the only 64-bit architectures with sys_ipc use unsigned
long for the "third" argument as it gets casted to a pointer later, while
it traditionally is an "int" like most other paramters.  frv goes even
further and uses unsigned long for all parameters execept for "ptr" which
is a pointer type everywhere.  The change from int to unsigned long for
"third" and back to "int" for the others on frv should be fine due to the
in-register calling conventions for syscalls (we already had a similar
issue with the generic sys_ptrace), but I'd prefer to have the arch
maintainers looks over this in details.

Except for that h8300, m68k and m68knommu lack an impplementation of the
semtimedop sub call which this patch adds, and various architectures have
gets used - at least on i386 it seems superflous as the compat code on
x86-64 and ia64 doesn't even bother to implement it.

[akpm@linux-foundation.org: add sys_ipc to sys_ni.c]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:32 -08:00
Thomas Gleixner
802702e0c2 timer: Try to survive timer callback preempt_count leak
If a timer callback leaks preempt_count we currently assert a
BUG(). That makes it unnecessarily hard to retrieve information about
the problem especially on laptops and headless stations.

There is a decent chance to survive the preempt_count leak by
restoring the preempt_count to the value before the callback. That
allows in many cases to get valuable information about the root cause
of the problem.

We carried that fixup in preempt-rt for years and were able to decode
such wreckage quite a few times.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linux Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Veen <arjan@infradead.org>
2010-03-12 22:40:44 +01:00
Thomas Gleixner
576da126a6 timer: Split out timer function call
The ident level is starting to be annoying. More white space than
actual code. Split out the timer function call into its own function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:43 +01:00
Uwe Kleine-König
06f71b922c timer: Print function name for timer callbacks modifying preemption count
A function scheduled with a timer must not exit with a different
preempt count than it was entered. To make helping users running into
the corresponding BUG() easier also print the name of the bad function
not only its address.

[ tglx: Sanitized printk ]

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: johnstul@us.ibm.com
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:42 +01:00
John Stultz
64ce4c2f52 time: Clean up warp_clock()
warp_clock() currently accesses timekeeping internal state directly, which
is unnecessary.  Convert it to use the proper timekeeping interfaces.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:42 +01:00
Stanislaw Gruszka
c28739375b cpu-timers: Avoid iterating over all threads in fastpath_timer_check()
Spread p->sighand->siglock locking scope to make sure that
fastpath_timer_check() never iterates over all threads. Without
locking there is small possibility that signal->cputimer will stop
running while we write values to signal->cputime_expires.

Calling thread_group_cputime() from fastpath_timer_check() is not only
bad because it is slow, also it is racy with __exit_signal() which can
lead to invalid signal->{s,u}time values.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:41 +01:00
Stanislaw Gruszka
1f169f84d2 cpu-timers: Change SIGEV_NONE timer implementation
When user sets up a timer without associated signal and process does
not use any other cpu timers and does not exit, tsk->signal->cputimer
is enabled and running forever.

Avoid running the timer for no reason.

I used below program to check patch does not break current user space
visible behavior.

 #include <sys/time.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <time.h>
 #include <unistd.h>
 #include <assert.h>

 void consume_cpu(void)
 {
	int i = 0;
	int count = 0;

	for(i=0; i<100000000; i++)
		count++;
 }

 int main(void)
 {
	int i;
	struct sigaction act;
	struct sigevent evt = { };
	timer_t tid;
	struct itimerspec spec = { };

	evt.sigev_notify = SIGEV_NONE;
	assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt,  &tid) == 0);

	spec.it_value.tv_sec = 10;
	assert(timer_settime(tid, 0, &spec,  NULL) == 0);

	for (i = 0; i < 30; i++) {
		consume_cpu();
		memset(&spec, 0, sizeof(spec));
		assert(timer_gettime(tid, &spec) == 0);
		printf("%lu.%09lu\n",
			(unsigned long) spec.it_value.tv_sec,
			(unsigned long) spec.it_value.tv_nsec);
	}

	assert(timer_delete(tid) == 0);
	return 0;
 }

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:40 +01:00
Stanislaw Gruszka
ae1a78eecc cpu-timers: Return correct previous timer reload value
According POSIX we need to correctly set old timer it_interval value when
user request that in timer_settime().  Tested using below program.

 #include <sys/time.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <unistd.h>
 #include <assert.h>

 int main(void)
 {
	struct sigaction act;
	struct sigevent evt = { };
	timer_t tid;
	struct itimerspec spec, u_spec, k_spec;

	evt.sigev_notify = SIGEV_SIGNAL;
	evt.sigev_signo = SIGPROF;
	assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt,  &tid) == 0);

	spec.it_value.tv_sec = 1;
	spec.it_value.tv_nsec = 2;
	spec.it_interval.tv_sec = 3;
	spec.it_interval.tv_nsec = 4;
	u_spec = spec;
	assert(timer_settime(tid, 0, &spec,  NULL) == 0);

	spec.it_value.tv_sec = 5;
	spec.it_value.tv_nsec = 6;
	spec.it_interval.tv_sec = 7;
	spec.it_interval.tv_nsec = 8;
	assert(timer_settime(tid, 0, &spec,  &k_spec) == 0);

 #define PRT(val) printf(#val ":\t%d/%d\n", (int) u_spec.val, (int) k_spec.val)
	PRT(it_value.tv_sec);
	PRT(it_value.tv_nsec);
	PRT(it_interval.tv_sec);
	PRT(it_interval.tv_nsec);

	return 0;
 }

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:40 +01:00
Stanislaw Gruszka
5eb9aa6414 cpu-timers: Cleanup arm_timer()
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:39 +01:00
Stanislaw Gruszka
f55db60904 cpu-timers: Simplify RLIMIT_CPU handling
Let always set signal->cputime_expires expiration cache when setting
new itimer, POSIX 1.b timer, and RLIMIT_CPU.  Since we are
initializing prof_exp expiration cache during fork(), this allows to
remove "RLIMIT_CPU != inf" check from fastpath_timer_check() and do
some other cleanups.

Checked against regression using test cases from:
http://marc.info/?l=linux-kernel&m=123749066504641&w=4
http://marc.info/?l=linux-kernel&m=123811277916642&w=2

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 22:40:39 +01:00
Stanislaw Gruszka
15365c108e posix-cpu-timers: Reset expire cache when no timer is running
When a process deletes cpu timer or a timer expires we do not clear
the expiration cache sig->cputimer_expires.

As a result the fastpath_timer_check() which prevents us to loop over
all threads in case no timer is active is not working and we run the
slow path needlessly on every tick.

Zero sig->cputimer_expires in stop_process_timers().

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Spencer Candland <spencer@bluehost.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 19:12:18 +01:00
Andrew Morton
829b6c1ef4 timer stats: Fix del_timer_sync() and try_to_del_timer_sync()
These functions forgot to run timer_stats_timer_clear_start_info().  It's
unobvious what effect this has and whether it matters much - we won't be
printing it out anyway if the timer's detached.

Untested, just an Ingo trollpatch.

[ Nevertheless correct - tglx ]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 19:11:29 +01:00
Thomas Gleixner
80a05b9ffa clockevents: Sanitize min_delta_ns adjustment and prevent overflows
The current logic which handles clock events programming failures can
increase min_delta_ns unlimited and even can cause overflows.

Sanitize it by:
 - prevent zero increase when min_delta_ns == 1
 - limiting min_delta_ns to a jiffie
 - bail out if the jiffie limit is hit
 - add retries stats for /proc/timer_list so we can gather data

Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-12 19:10:29 +01:00
Ingo Molnar
937779db13 Merge branch 'perf/urgent' into perf/core
Merge reason: We want to queue up a dependent patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-12 10:20:59 +01:00
Mike Galbraith
beac4c7e4a sched: Remove AFFINE_WAKEUPS feature
Disabling affine wakeups is too horrible to contemplate.  Remove the feature flag.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301890.6785.50.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:53 +01:00
Mike Galbraith
13814d42e4 sched: Remove ASYM_GRAN feature
This features has been enabled for quite a while, after testing showed that
easing preemption for light tasks was harmful to high priority threads.

Remove the feature flag.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301675.6785.44.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:53 +01:00
Mike Galbraith
c6ee36c423 sched: Remove SYNC_WAKEUPS feature
Sync wakeups are critical functionality with a long history.  Remove it, we don't
need the branch or icache footprint.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301817.6785.47.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:53 +01:00
Mike Galbraith
f2e74eeac0 sched: Remove WAKEUP_SYNC feature
This feature never earned its keep, remove it.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301591.6785.42.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:52 +01:00
Mike Galbraith
5ca9880c6f sched: Remove FAIR_SLEEPERS feature
Our preemption model relies too heavily on sleeper fairness to disable it
without dire consequences.  Remove the feature, and save a branch or two.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301520.6785.40.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:52 +01:00
Mike Galbraith
6bc6cf2b61 sched: Remove NORMALIZED_SLEEPER
This feature hasn't been enabled in a long time, remove effectively dead code.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301447.6785.38.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:52 +01:00
Mike Galbraith
8b911acdf0 sched: Fix select_idle_sibling()
Don't bother with selection when the current cpu is idle.  Recent load
balancing changes also make it no longer necessary to check wake_affine()
success before returning the selected sibling, so we now always use it.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301369.6785.36.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:51 +01:00
Mike Galbraith
21406928af sched: Tweak sched_latency and min_granularity
Allow LAST_BUDDY to kick in sooner, improving cache utilization as soon as
a second buddy pair arrives on scene.  The cost is latency starting to climb
sooner, the tbenefit for tbench 8 on my Q6600 box is ~2%.  No detrimental
effects noted in normal idesktop usage.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301285.6785.34.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:51 +01:00
Mike Galbraith
a64692a3af sched: Cleanup/optimize clock updates
Now that we no longer depend on the clock being updated prior to enqueueing
on migratory wakeup, we can clean up a bit, placing calls to update_rq_clock()
exactly where they are needed, ie on enqueue, dequeue and schedule events.

In the case of a freshly enqueued task immediately preempting, we can skip the
update during preemption, as the clock was just updated by the enqueue event.
We also save an unneeded call during a migratory wakeup by not updating the
previous runqueue, where update_curr() won't be invoked.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301199.6785.32.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:50 +01:00
Mike Galbraith
e12f31d3e5 sched: Remove avg_overlap
Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
was detrimentally affected by cross-cpu wakeups, this because we are missing
the necessary call to update_curr().  This can't be fixed without increasing
overhead in our already too fat fastpath.

Additionally, with recent load balancing changes making us prefer to place tasks
in an idle cache domain (which is good for compute bound loads), communicating
tasks suffer when a sync wakeup, which would enable affine placement, is turned
into a non-sync wakeup by SYNC_LESS.  With one task on the runqueue, wake_affine()
rejects the affine wakeup request, leaving the unfortunate where placed, taking
frequent cache misses.

Remove it, and recover some fastpath cycles.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301121.6785.30.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:50 +01:00
Mike Galbraith
b42e0c41a4 sched: Remove avg_wakeup
Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
outlived it's usefullness.  With intervening load balancing changes, I cannot
see any difference with/without, so recover there fastpath cycles.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301062.6785.29.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:50 +01:00
Mike Galbraith
39c0cbe215 sched: Rate-limit nohz
Entering nohz code on every micro-idle is costing ~10% throughput for netperf
TCP_RR when scheduling cross-cpu.  Rate limiting entry fixes this, but raises
ticks a bit.  On my Q6600, an idle box goes from ~85 interrupts/sec to 128.

The higher the context switch rate, the more nohz entry costs.  With this patch
and some cycle recovery patches in my tree, max cross cpu context switch rate is
improved by ~16%, a large portion of which of which is this ratelimiting.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268301003.6785.28.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 18:32:49 +01:00
eranian@google.com
9b33fa6ba0 perf_events: Improve task_sched_in()
This patch is an optimization in perf_event_task_sched_in() to avoid
scheduling the events twice in a row.

Without it, the perf_disable()/perf_enable() pair is invoked twice,
thereby pinned events counts while scheduling flexible events and we go
throuh hw_perf_enable() twice.

By encapsulating, the whole sequence into perf_disable()/perf_enable() we
ensure, hw_perf_enable() is going to be invoked only once because of the
refcount protection.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268288765-5326-1-git-send-email-eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 15:23:28 +01:00
Lucas De Marchi
41acab8851 sched: Implement group scheduler statistics in one struct
Put all statistic fields of sched_entity in one struct, sched_statistics,
and embed it into sched_entity.

This change allows to memset the sched_statistics to 0 when needed (for
instance when forking), avoiding bugs of non initialized fields.

Signed-off-by: Lucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268275065-18542-1-git-send-email-lucas.de.marchi@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 15:22:28 +01:00
Peter Zijlstra
3d07467b7a sched: Fix pick_next_highest_task_rt() for cgroups
Since pick_next_highest_task_rt() already iterates all the cgroups and
is really only interested in tasks, skip over the !task entries.

Reported-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Dhaval Giani <dhaval.giani@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 15:21:50 +01:00
Xiao Guangrong
639fe4b12f perf: export perf_trace_regs and perf_arch_fetch_caller_regs
Export perf_trace_regs and perf_arch_fetch_caller_regs since module will
use these.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
[ use EXPORT_PER_CPU_SYMBOL_GPL() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4B989C1B.2090407@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 15:21:29 +01:00
Masami Hiramatsu
83ff56f46a kprobes: Calculate the index correctly when freeing the out-of-line execution slot
From : Ananth N Mavinakayanahalli <ananth@in.ibm.com>

When freeing the instruction slot, the arithmetic to calculate
the index of the slot in the page needs to account for the total
size of the instruction on the various architectures.

Calculate the index correctly when freeing the out-of-line
execution slot.

Reported-by: Sachin Sant <sachinp@in.ibm.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <4B9667AB.9050507@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 14:06:16 +01:00
Dan Carpenter
ab3b3aa5dd sched: Cleanup: remove unused variable in try_to_wake_up()
We haven't used the "orig_rq" variable since
055a00865d "Fix/add missing update_rq_clock() calls"

Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: efault@gmx.de
LKML-Reference: <20100306111752.GL4958@bicker>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 13:59:59 +01:00
Ingo Molnar
915a0b575f Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent 2010-03-11 13:39:33 +01:00
Paul E. McKenney
007b09243b rcu: Increase RCU CPU stall timeouts if PROVE_RCU
CONFIG_PROVE_RCU imposes additional overhead on the kernel, so
increase the RCU CPU stall timeouts in an attempt to allow for
this effect.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-2-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 13:38:01 +01:00
Paul E. McKenney
3f379b03fb ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
Replace the calls to read_barrier_depends() in
ftrace_list_func() with rcu_dereference_raw() to improve
readability.  The reason that we use rcu_dereference_raw() here
is that removed entries are never freed, instead they are simply
leaked.  This is one of a very few cases where use of
rcu_dereference_raw() is the long-term right answer.  And I
don't yet know of any others.  ;-)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 13:38:01 +01:00
Paul Mackerras
220b140b52 perf_event: Fix oops triggered by cpu offline/online
Anton Blanchard found that he could reliably make the kernel hit a
BUG_ON in the slab allocator by taking a cpu offline and then online
while a system-wide perf record session was running.

The reason is that when the cpu comes up, we completely reinitialize
the ctx field of the struct perf_cpu_context for the cpu.  If there is
a system-wide perf record session running, then there will be a struct
perf_event that has a reference to the context, so its refcount will
be 2.  (The perf_event has been removed from the context's group_entry
and event_entry lists by perf_event_exit_cpu(), but that doesn't
remove the perf_event's reference to the context and doesn't decrement
the context's refcount.)

When the cpu comes up, perf_event_init_cpu() gets called, and it calls
__perf_event_init_context() on the cpu's context.  That resets the
refcount to 1.  Then when the perf record session finishes and the
perf_event is closed, the refcount gets decremented to 0 and the
context gets kfreed after an RCU grace period.  Since the context
wasn't kmalloced -- it's part of a per-cpu variable -- bad things
happen.

In fact we don't need to completely reinitialize the context when the
cpu comes up.  It's sufficient to initialize the context once at boot,
but we need to do it for all possible cpus.

This moves the context initialization to happen at boot time.  With
this, we don't trash the refcount and the context never gets kfreed,
and we don't hit the BUG_ON.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Anton Blanchard <anton@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-11 12:43:51 +01:00
Thomas Gleixner
0b1adaa031 genirq: Prevent oneshot irq thread race
Lars-Peter pointed out that the oneshot threaded interrupt handler
code has the following race:

 CPU0                            CPU1
 hande_level_irq(irq X)
   mask_ack_irq(irq X)
   handle_IRQ_event(irq X)
     wake_up(thread_handler)
                                 thread handler(irq X) runs
                                 finalize_oneshot(irq X)
				  does not unmask due to 
				  !(desc->status & IRQ_MASKED)

 return from irq
 does not unmask due to
 (desc->status & IRQ_ONESHOT)
  				  
This leaves the interrupt line masked forever. 

The reason for this is the inconsistent handling of the IRQ_MASKED
flag. Instead of setting it in the mask function the oneshot support
sets the flag after waking up the irq thread.

The solution for this is to set/clear the IRQ_MASKED status whenever
we mask/unmask an interrupt line. That's the easy part, but that
cleanup opens another race:

 CPU0                            CPU1
 hande_level_irq(irq)
   mask_ack_irq(irq)
   handle_IRQ_event(irq)
     wake_up(thread_handler)
                                 thread handler(irq) runs
                                 finalize_oneshot_irq(irq)
				  unmask(irq)
     irq triggers again
     handle_level_irq(irq)
       mask_ack_irq(irq)
     return from irq due to IRQ_INPROGRESS				  

 return from irq
 does not unmask due to
 (desc->status & IRQ_ONESHOT)

This requires that we synchronize finalize_oneshot_irq() with the
primary handler. If IRQ_INPROGESS is set we wait until the primary
handler on the other CPU has returned before unmasking the interrupt
line again.

We probably have never seen that problem because it does not happen on
UP and on SMP the irqbalancer protects us by pinning the primary
handler and the thread to the same CPU.

Reported-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
2010-03-10 17:45:14 +01:00
Frederic Weisbecker
97d5a22005 perf: Drop the obsolete profile naming for trace events
Drop the obsolete "profile" naming used by perf for trace events.
Perf can now do more than simple events counting, so generalize
the API naming.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
2010-03-10 14:47:18 +01:00
Frederic Weisbecker
c530665c31 perf: Take a hot regs snapshot for trace events
We are taking a wrong regs snapshot when a trace event triggers.
Either we use get_irq_regs(), which gives us the interrupted
registers if we are in an interrupt, or we use task_pt_regs()
which gives us the state before we entered the kernel, assuming
we are lucky enough to be no kernel thread, in which case
task_pt_regs() returns the initial set of regs when the kernel
thread was started.

What we want is different. We need a hot snapshot of the regs,
so that we can get the instruction pointer to record in the
sample, the frame pointer for the callchain, and some other
things.

Let's use the new perf_fetch_caller_regs() for that.

Comparison with perf record -e lock: -R -a -f -g
Before:

        perf  [kernel]                   [k] __do_softirq
               |
               --- __do_softirq
                  |
                  |--55.16%-- __open
                  |
                   --44.84%-- __write_nocancel

After:

            perf  [kernel]           [k] perf_tp_event
               |
               --- perf_tp_event
                  |
                  |--41.07%-- lock_acquire
                  |          |
                  |          |--39.36%-- _raw_spin_lock
                  |          |          |
                  |          |          |--7.81%-- hrtimer_interrupt
                  |          |          |          smp_apic_timer_interrupt
                  |          |          |          apic_timer_interrupt

The old case was producing unreliable callchains. Now having
right frame and instruction pointers, we have the trace we
want.

Also syscalls and kprobe events already have the right regs,
let's use them instead of wasting a retrieval.

v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs()

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
2010-03-10 14:40:38 +01:00
Frederic Weisbecker
5331d7b846 perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot
Events that trigger overflows by interrupting a context can
use get_irq_regs() or task_pt_regs() to retrieve the state
when the event triggered. But this is not the case for some
other class of events like trace events as tracepoints are
executed in the same context than the code that triggered
the event.

It means we need a different api to capture the regs there,
namely we need a hot snapshot to get the most important
informations for perf: the instruction pointer to get the
event origin, the frame pointer for the callchain, the code
segment for user_mode() tests (we always use __KERNEL_CS as
trace events always occur from the kernel) and the eflags
for further purposes.

v2: rename perf_save_regs to perf_fetch_caller_regs as per
Masami's suggestion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Archs <linux-arch@vger.kernel.org>
2010-03-10 14:39:35 +01:00
Frederic Weisbecker
db2c4c7791 lockdep: Move lock events under lockdep recursion protection
There are rcu locked read side areas in the path where we submit
a trace event. And these rcu_read_(un)lock() trigger lock events,
which create recursive events.

One pair in do_perf_sw_event:

__lock_acquire
      |
      |--96.11%-- lock_acquire
      |          |
      |          |--27.21%-- do_perf_sw_event
      |          |          perf_tp_event
      |          |          |
      |          |          |--49.62%-- ftrace_profile_lock_release
      |          |          |          lock_release
      |          |          |          |
      |          |          |          |--33.85%-- _raw_spin_unlock

Another pair in perf_output_begin/end:

__lock_acquire
      |--23.40%-- perf_output_begin
      |          |          __perf_event_overflow
      |          |          perf_swevent_overflow
      |          |          perf_swevent_add
      |          |          perf_swevent_ctx_event
      |          |          do_perf_sw_event
      |          |          perf_tp_event
      |          |          |
      |          |          |--55.37%-- ftrace_profile_lock_acquire
      |          |          |          lock_acquire
      |          |          |          |
      |          |          |          |--37.31%-- _raw_spin_lock

The problem is not that much the trace recursion itself, as we have a
recursion protection already (though it's always wasteful to recurse).
But the trace events are outside the lockdep recursion protection, then
each lockdep event triggers a lock trace, which will trigger two
other lockdep events. Here the recursive lock trace event won't
be taken because of the trace recursion, so the recursion stops there
but lockdep will still analyse these new events:

To sum up, for each lockdep events we have:

	lock_*()
	     |
             trace lock_acquire
                  |
                  ----- rcu_read_lock()
                  |          |
                  |          lock_acquire()
                  |          |
                  |          trace_lock_acquire() (stopped)
                  |          |
		  |          lockdep analyze
                  |
                  ----- rcu_read_unlock()
                             |
                             lock_release
                             |
                             trace_lock_release() (stopped)
                             |
                             lockdep analyze

And you can repeat the above two times as we have two rcu read side
sections when we submit an event.

This is fixed in this patch by moving the lock trace event under
the lockdep recursion protection.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
2010-03-10 14:26:07 +01:00
Peter Zijlstra
d4944a0666 perf: Provide better condition for event rotation
Try to avoid useless rotation and PMU disables.

[ Could be improved by keeping a nr_runnable count to better account
  for the < PERF_STAT_INACTIVE counters ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-10 13:22:36 +01:00
Peter Zijlstra
32975a4f11 perf: Optimize perf_disable
Currently we always call hw_perf_disable(), even if its already disabled,
this seems superflous, esp. since it cannot be made NMI safe (see further
patches).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-10 13:22:25 +01:00
Peter Zijlstra
3f6da39053 perf: Rework and fix the arch CPU-hotplug hooks
Remove the hw_perf_event_*() hotplug hooks in favour of per PMU hotplug
notifiers. This has the advantage of reducing the static weak interface
as well as exposing all hotplug actions to the PMU.

Use this to fix x86 hotplug usage where we did things in ONLINE which
should have been done in UP_PREPARE or STARTING.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: paulus@samba.org
Cc: eranian@google.com
Cc: robert.richter@amd.com
Cc: fweisbec@gmail.com
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
LKML-Reference: <20100305154128.736225361@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-10 13:22:24 +01:00
Peter Zijlstra
dc1d628a67 perf: Provide generic perf_sample_data initialization
This makes it easier to extend perf_sample_data and fixes a bug on arm
and sparc, which failed to set ->raw to NULL, which can cause crashes
when combined with PERF_SAMPLE_RAW.

It also optimizes PowerPC and tracepoint, because the struct
initialization is forced to zero out the whole structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jean Pihet <jpihet@mvista.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Jamie Iles <jamie.iles@picochip.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: stable@kernel.org
LKML-Reference: <20100304140100.315416040@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-10 13:22:23 +01:00
Ingo Molnar
548b841669 Merge commit 'v2.6.34-rc1' into perf/urgent
Conflicts:
	tools/perf/util/probe-event.c

Merge reason: Pick up -rc1 and resolve the conflict as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-09 17:11:53 +01:00
Jiri Kosina
318ae2edc3 Merge branch 'for-next' into for-linus
Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c
2010-03-08 16:55:37 +01:00
Eric W. Biederman
361795b1eb sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributes
A little more whack-a-mole annotating the dynamic sysfs attributes.  I
had everything built into my earlier test kernel, and so I missed
these.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:51 -08:00
Eric W. Biederman
a07e4156a2 sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on dynamic attributes
These are the non-static sysfs attributes that exist on
my test machine.  Fix them to use sysfs_attr_init or
sysfs_bin_attr_init as appropriate.   It simply requires
making a sysfs attribute present to see this.  So this
is a little bit tedious but otherwise not too bad.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:51 -08:00
Emese Revfy
52cf25d0ab Driver core: Constify struct sysfs_ops in struct kobj_type
Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

 * prevents modification of data that is shared
   (referenced) by many other structure instances
   at runtime

 * detects/prevents accidental (but not intentional)
   modification attempts on archs that enforce
   read-only kernel data at runtime

 * potentially better optimized code as the compiler
   can assume that the const data cannot be changed

 * the compiler/linker move const data into .rodata
   and therefore exclude them from false sharing

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:49 -08:00
Emese Revfy
9cd43611cc kobject: Constify struct kset_uevent_ops
Constify struct kset_uevent_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

 * prevents modification of data that is shared
   (referenced) by many other structure instances
   at runtime

 * detects/prevents accidental (but not intentional)
   modification attempts on archs that enforce
   read-only kernel data at runtime

 * potentially better optimized code as the compiler
   can assume that the const data cannot be changed

 * the compiler/linker move const data into .rodata
   and therefore exclude them from false sharing

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:49 -08:00
Andi Kleen
c9be0a36f9 sysdev: Pass attribute in sysdev_class attributes show/store
Passing the attribute to the low level IO functions allows all kinds
of cleanups, by sharing low level IO code without requiring
an own function for every piece of data.

Also drivers can extend the attributes with own data fields
and use that in the low level function.

Similar to sysdev_attributes and normal attributes.

This is a tree-wide sweep, converting everything in one go.

No functional changes in this patch other than passing the new
argument everywhere.

Tested on x86, the non x86 parts are uncompiled.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:47 -08:00
Daisuke HATAYAMA
8d9032bbe4 elf coredump: add extended numbering support
The current ELF dumper implementation can produce broken corefiles if
program headers exceed 65535.  This number is determined by the number of
vmas which the process have.  In particular, some extreme programs may use
more than 65535 vmas.  (If you google max_map_count, you can find some
users facing this problem.) This kind of program never be able to generate
correct coredumps.

This patch implements ``extended numbering'' that uses sh_info field of
the first section header instead of e_phnum field in order to represent
upto 4294967295 vmas.

This is supported by
AMD64-ABI(http://www.x86-64.org/documentation.html) and
Solaris(http://docs.sun.com/app/docs/doc/817-1984/).
Of course, we are preparing patches for gdb and binutils.

Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:46 -08:00
Daisuke HATAYAMA
1fcccbac89 elf coredump: replace ELF_CORE_EXTRA_* macros by functions
elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding
macro for hiding _multiline_ logics in functions.  This patch removes
#ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions.  For
architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in
order to reduce a range of modification.

This cleanup is for my next patches, but I think this cleanup itself is
worth doing regardless of my firnal purpose.

Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:45 -08:00
Gustavo F. Padovan
cea83886dd printk: avoid warning when CONFIG_PRINTK is disabled
kernel/printk.c:72: warning: `saved_console_loglevel' defined but not used

Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:33 -08:00
Tetsuo Handa
9728e5d6e6 kernel/pid.c: update comment on find_task_by_pid_ns
tasklist_lock does protect the task and its pid, it can't go away.  The
problem is that find_pid_ns() itself is unsafe without rcu lock, it can
race with copy_process()->free_pid(any_pid).

Protecting copy_process()->free_pid(any_pid) with tasklist_lock would make
it possible to call find_task_by_pid_ns() under tasklist safely, but we
don't do so because we are trying to get rid of the read_lock sites of
tasklist_lock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:33 -08:00
Anton Blanchard
8aeee85a29 panic: fix panic_timeout accuracy when running on a hypervisor
I've had some complaints about panic_timeout being wildly innacurate on
shared processor PowerPC partitions (a 3 minute panic_timeout taking 30
minutes).

The problem is we loop on mdelay(1) and with a 1ms in 10ms hypervisor
timeslice each of these will take 10ms (ie 10x) longer.  I expect other
platforms with shared processor hypervisors will see the same issue.

This patch keeps the old behaviour if we have a panic_blink (only keyboard
LEDs right now) and does 1 second mdelays if we don't.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:33 -08:00
Jiri Slaby
78d7d407b6 kernel core: use helpers for rlimits
Make sure compiler won't do weird things with limits.  E.g.  fetching them
twice may return 2 different values after writable limits are implemented.

I.e.  either use rlimit helpers added in commit 3e10e716ab ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:33 -08:00
Jiri Slaby
d4bb527438 posix-cpu-timers: cleanup rlimits usage
Fetch rlimit (both hard and soft) values only once and work on them.  It
removes many accesses through sig structure and makes the code cleaner.

Mostly a preparation for writable resource limits support.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:32 -08:00
Thiago Farina
f3abd4f953 kernel/exit.c: fix shadows sparse warning
kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one
kernel/exit.c:1173:21: originally declared here

Signed-off-by: Thiago Farina <tfransosi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:32 -08:00
Jaswinder Singh Rajput
9c03c38356 includecheck fix for kernel/params.c
Fix the following 'make includecheck' warning:
  kernel/params.c: linux/string.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:32 -08:00
Dan Carpenter
5f1664f92b splice: comparing unsigned int < 0
"ret" needs to be signed or the error handling for splice_to_pipe() won't
work correctly.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Tom Zanussi <zanussi@comcast.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:32 -08:00
Chen Gong
87d5e0236d kernel/cpu.c: delete deprecated definition in cpu_up()
Additional_cpus is only supported for IA64 now.  X86_64 should not be
included.

Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:28 -08:00
Rafael J. Wysocki
452aa6999e mm/pm: force GFP_NOIO during suspend/hibernation and resume
There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang, because
the I/O operations they depend on cannot be completed due to the
underlying devices being suspended.

Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the original
values of these bits in gfp_allowed_mask durig the subsequent resume.

[akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:26 -08:00
Rik van Riel
5beb493052 mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:26 -08:00
KAMEZAWA Hiroyuki
34e55232e5 mm: avoid false sharing of mm_counter
Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter.  RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

 rough estimation with small benchmark on parallel thread (2threads) shows
 [before]
     4.5 cache-miss/faults
 [after]
     4.0 cache-miss/faults
 Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:24 -08:00
KAMEZAWA Hiroyuki
d559db086f mm: clean up mm_counter
Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
  - defined in mm.h as inlinf functions
  - use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:23 -08:00
Akinobu Mita
984b3f5746 bitops: rename for_each_bit() to for_each_set_bit()
Rename for_each_bit to for_each_set_bit in the kernel source tree.  To
permit for_each_clear_bit(), should that ever be added.

The patch includes a macro to map the old for_each_bit() onto the new
for_each_set_bit().  This is a (very) temporary thing to ease the migration.

[akpm@linux-foundation.org: add temporary for_each_bit()]
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:23 -08:00
Tim Bird
0e95017355 function-graph: Add tracing_thresh support to function_graph tracer
Add support for tracing_thresh to the function_graph tracer.  This
version of this feature isolates the checks into new entry and
return functions, to avoid adding more conditional code into the
main function_graph paths.

When the tracing_thresh is set and the function graph tracer is
enabled, only the functions that took longer than the time in
microseconds that was set in tracing_thresh are recorded. To do this
efficiently, only the function exits are recorded:

 [tracing]# echo 100 > tracing_thresh
 [tracing]# echo function_graph > current_tracer
 [tracing]# cat trace
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  1) ! 119.214 us  |  } /* smp_apic_timer_interrupt */
  1)   <========== |
  0) ! 101.527 us  |              } /* __rcu_process_callbacks */
  0) ! 126.461 us  |            } /* rcu_process_callbacks */
  0) ! 145.111 us  |          } /* __do_softirq */
  0) ! 149.667 us  |        } /* do_softirq */
  0) ! 168.817 us  |      } /* irq_exit */
  0) ! 248.254 us  |    } /* smp_apic_timer_interrupt */

Also, add support for specifying tracing_thresh on the kernel
command line.  When used like so: "tracing_thresh=200 ftrace=function_graph"
this can be used to analyse system startup.  It is important to disable
tracing soon after boot, in order to avoid losing the trace data.

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tim Bird <tim.bird@am.sony.com>
LKML-Reference: <4B87098B.4040308@am.sony.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-05 21:20:57 -05:00
Arnaldo Carvalho de Melo
1acaa1b2d9 tracing: Update the comm field in the right variable in update_max_tr
The latency output showed:

 #    | task: -3 (uid:0 nice:0 policy:1 rt_prio:99)

The comm is missing in the "task:" and it looks like a minus 3 is
the output. The correct display should be:

 #    | task: migration/0-3 (uid:0 nice:0 policy:1 rt_prio:99)

The problem is that the comm is being stored in the wrong data
structure. The max_tr.data[cpu] is what stores the comm, not the
tr->data[cpu].

Before this patch the max_tr.data[cpu]->comm was zeroed and the /debug/trace
ended up showing just the '-' sign followed by the pid.

Also remove a needless initialization of max_data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1267824230-23861-1-git-send-email-acme@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-05 21:12:08 -05:00
Steven Rostedt
a094fe04c7 function-graph: Use comment notation for func names of dangling '}'
When a '}' does not have a matching function start, the name is printed
within parenthesis. But this makes it confusing between ending '}'
and function starts. This patch makes the function name appear in C comment
notation.

Old view:
 3)   1.281 us    |            } (might_fault)
 3)   3.620 us    |          } (filldir)
 3)   5.251 us    |        } (call_filldir)
 3)               |        call_filldir() {
 3)               |          filldir() {

New view:
 3)   1.281 us    |            } /* might_fault */
 3)   3.620 us    |          } /* filldir */
 3)   5.251 us    |        } /* call_filldir */
 3)               |        call_filldir() {
 3)               |          filldir() {

Requested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-05 21:11:13 -05:00
Steven Rostedt
801c29fd1f function-graph: Fix unused reference to ftrace_set_func()
The declaration of ftrace_set_func() is at the start of the ftrace.c file
and wrapped with a #ifdef CONFIG_FUNCTION_GRAPH condition. If function
graph tracing is enabled but CONFIG_DYNAMIC_FTRACE is not, a warning
about that function being declared static and unused is given.

This really should have been placed within the CONFIG_FUNCTION_GRAPH
condition that uses ftrace_set_func().

Moving the declaration down fixes the warning and makes the code cleaner.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-05 21:00:30 -05:00
Linus Torvalds
660f6a360b Merge branch 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Issue at least one memory barrier in stop_machine_text_poke()
  perf probe: Correct probe syntax on command line help
  perf probe: Add lazy line matching support
  perf probe: Show more lines after last line
  perf probe: Check function address range strictly in line finder
  perf probe: Use libdw callback routines
  perf probe: Use elfutils-libdw for analyzing debuginfo
  perf probe: Rename probe finder functions
  perf probe: Fix bugs in line range finder
  perf probe: Update perf probe document
  perf probe: Do not show --line option without dwarf support
  kprobes: Add documents of jump optimization
  kprobes/x86: Support kprobes jump optimization on x86
  x86: Add text_poke_smp for SMP cross modifying code
  kprobes/x86: Cleanup save/restore registers
  kprobes/x86: Boost probes when reentering
  kprobes: Jump optimization sysctl interface
  kprobes: Introduce kprobes jump optimization
  kprobes: Introduce generic insn_slot framework
  kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE
2010-03-05 10:50:22 -08:00
Linus Torvalds
586fac13f8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  padata: Allocate the cpumask for the padata instance
  crypto: authenc - Move saved IV in front of the ablkcipher request
  crypto: hash - Fix handling of unaligned buffers
  crypto: authenc - Use correct ahash complete functions
  crypto: md5 - Set statesize
2010-03-05 10:47:57 -08:00
Linus Torvalds
0f2cc4ecd8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  init: Open /dev/console from rootfs
  mqueue: fix typo "failues" -> "failures"
  mqueue: only set error codes if they are really necessary
  mqueue: simplify do_open() error handling
  mqueue: apply mathematics distributivity on mq_bytes calculation
  mqueue: remove unneeded info->messages initialization
  mqueue: fix mq_open() file descriptor leak on user-space processes
  fix race in d_splice_alias()
  set S_DEAD on unlink() and non-directory rename() victims
  vfs: add NOFOLLOW flag to umount(2)
  get rid of ->mnt_parent in tomoyo/realpath
  hppfs can use existing proc_mnt, no need for do_kern_mount() in there
  Mirror MS_KERNMOUNT in ->mnt_flags
  get rid of useless vfsmount_lock use in put_mnt_ns()
  Take vfsmount_lock to fs/internal.h
  get rid of insanity with namespace roots in tomoyo
  take check for new events in namespace (guts of mounts_poll()) to namespace.c
  Don't mess with generic_permission() under ->d_lock in hpfs
  sanitize const/signedness for udf
  nilfs: sanitize const/signedness in dealing with ->d_name.name
  ...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04 08:15:33 -08:00
Paul E. McKenney
8d53dd546f rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
Change the pair of rcu_dereference() calls in
ftrace_perf_buf_prepare() to rcu_dereference_sched().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1267667418-32233-3-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-04 12:07:35 +01:00
Paul E. McKenney
cc5b83a9f8 rcu: Add control variables to lockdep_rcu_dereference() diagnostics
Add the values of rcu_scheduler_active() and debug_locks() to
the lockdep_rcu_dereference() output to help diagnose RCU
lockdep splats that occur shortly after the scheduler starts.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267631219-8713-4-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-04 12:07:32 +01:00
Ingo Molnar
e02c4fd314 Merge branch 'tip/tracing/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent 2010-03-04 11:51:29 +01:00
Ingo Molnar
4f16d4e0c9 Merge branch 'perf/core' into perf/urgent
Merge reason: Switch from pre-merge topical split to the post-merge urgent track

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-04 11:47:52 +01:00
Paul E. McKenney
db1466b3e1 rcu: Use wrapper function instead of exporting tasklist_lock
Lockdep-RCU commit d11c563d exported tasklist_lock, which is not
a good thing.  This patch instead exports a function that uses
lockdep to check whether tasklist_lock is held.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Christoph Hellwig <hch@lst.de>
LKML-Reference: <1267631219-8713-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-04 11:46:14 +01:00
Ingo Molnar
0e064caf64 Merge branches 'core/futexes' and 'core/iommu' into core/urgent
Merge reason: Switch from topical split to the stabilization track

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-04 11:45:31 +01:00
Steffen Klassert
7478138782 padata: Allocate the cpumask for the padata instance
The cpumask of the padata instance was used without allocated.
This caused boot crashes if CONFIG_CPUMASK_OFFSTACK is enabled.
This patch fixes this by doing proper allocation for this cpumask.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-03-04 13:30:22 +08:00
Linus Torvalds
a27341cd5f Prioritize synchronous signals over 'normal' signals
This makes sure that we pick the synchronous signals caused by a
processor fault over any pending regular asynchronous signals sent to
use by [t]kill().

This is not strictly required semantics, but it makes it _much_ easier
for programs like Wine that expect to find the fault information in the
signal stack.

Without this, if a non-synchronous signal gets picked first, the delayed
asynchronous signal will have its signal context pointing to the new
signal invocation, rather than the instruction that caused the SIGSEGV
or SIGBUS in the first place.

This is not all that pretty, and we're discussing making the synchronous
signals more explicit rather than have these kinds of implicit
preferences of SIGSEGV and friends.  See for example

	http://bugzilla.kernel.org/show_bug.cgi?id=15395

for some of the discussion.  But in the meantime this is a simple and
fairly straightforward work-around, and the whole

	if (x & Y)
		x &= Y;

thing can be compiled into (and gcc does do it) just three instructions:

	movq    %rdx, %rax
	andl    $Y, %eax
	cmovne  %rax, %rdx

so it is at least a simple solution to a subtle issue.

Reported-and-tested-by: Pavel Vilim <wylda@volny.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-03 19:21:10 -08:00
Al Viro
9643f5d94a Merge branch 'for-fsnotify' into for-linus 2010-03-03 17:12:40 -05:00
Al Viro
1f707137b5 new helper: iterate_mounts()
apply function to vfsmounts in set returned by collect_mounts(),
stop if it returns non-zero.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
2096f759ab New helper: path_is_under(path1, path2)
Analog of is_subdir for vfsmount,dentry pairs, moved from audit_tree.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:55 -05:00
Al Viro
8737c9305b Switch may_open() and break_lease() to passing O_...
... instead of mixing FMODE_ and O_

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:21 -05:00
Linus Torvalds
2a32f2db13 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  resource: Fix broken indentation
  resource: Fix generic page_is_ram() for partial RAM pages
  x86, paravirt: Remove kmap_atomic_pte paravirt op.
  x86, vmi: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y
  x86, xen: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y
2010-03-03 09:11:02 -08:00
Linus Torvalds
fb7b096d94 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
  x86: Fix out of order of gsi
  x86: apic: Fix mismerge, add arch_probe_nr_irqs() again
  x86, irq: Keep chip_data in create_irq_nr and destroy_irq
  xen: Remove unnecessary arch specific xen irq functions.
  smp: Use nr_cpus= to set nr_cpu_ids early
  x86, irq: Remove arch_probe_nr_irqs
  sparseirq: Use radix_tree instead of ptrs array
  sparseirq: Change irq_desc_ptrs to static
  init: Move radix_tree_init() early
  irq: Remove unnecessary bootmem code
  x86: Add iMac9,1 to pci_reboot_dmi_table
  x86: Convert i8259_lock to raw_spinlock
  x86: Convert nmi_lock to raw_spinlock
  x86: Convert ioapic_lock and vector_lock to raw_spinlock
  x86: Avoid race condition in pci_enable_msix()
  x86: Fix SCI on IOAPIC != 0
  x86, ia32_aout: do not kill argument mapping
  x86, irq: Move __setup_vector_irq() before the first irq enable in cpu online path
  x86, irq: Update the vector domain for legacy irqs handled by io-apic
  x86, irq: Don't block IRQ0_VECTOR..IRQ15_VECTOR's on all cpu's
  ...
2010-03-03 08:15:37 -08:00
Linus Torvalds
a626b46e17 Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
  early_res: Need to save the allocation name in drop_range_partial()
  sparsemem: Fix compilation on PowerPC
  early_res: Add free_early_partial()
  x86: Fix non-bootmem compilation on PowerPC
  core: Move early_res from arch/x86 to kernel/
  x86: Add find_fw_memmap_area
  Move round_up/down to kernel.h
  x86: Make 32bit support NO_BOOTMEM
  early_res: Enhance check_and_double_early_res
  x86: Move back find_e820_area to e820.c
  x86: Add find_early_area_size
  x86: Separate early_res related code from e820.c
  x86: Move bios page reserve early to head32/64.c
  sparsemem: Put mem map for one node together.
  sparsemem: Put usemap for one node together
  x86: Make 64 bit use early_res instead of bootmem before slab
  x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
  x86: Make early_node_mem get mem > 4 GB if possible
  x86: Dynamically increase early_res array size
  x86: Introduce max_early_res and early_res_count
  ...
2010-03-03 08:15:05 -08:00
Linus Torvalds
0a135ba14d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: add __percpu sparse annotations to what's left
  percpu: add __percpu sparse annotations to fs
  percpu: add __percpu sparse annotations to core kernel subsystems
  local_t: Remove leftover local.h
  this_cpu: Remove pageset_notifier
  this_cpu: Page allocator conversion
  percpu, x86: Generic inc / dec percpu instructions
  local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
  module: Use this_cpu_xx to dynamically allocate counters
  local_t: Remove cpu_local_xx macros
  percpu: refactor the code in pcpu_[de]populate_chunk()
  percpu: remove compile warnings caused by __verify_pcpu_ptr()
  percpu: make accessors check for percpu pointer in sparse
  percpu: add __percpu for sparse.
  percpu: make access macros universal
  percpu: remove per_cpu__ prefix.
2010-03-03 07:34:18 -08:00
Denys Vlasenko
3d9a854c2d Rename .data[.percpu][.XXX] to .data[..percpu][..XXX].
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2010-03-03 11:26:00 +01:00
Lai Jiangshan
ac91d85456 tracing: Fix warning in s_next of trace file ops
This warning in s_next() can be triggered by lseek():
 [<c018b3f7>] ? s_next+0x77/0x80
 [<c013e3c1>] warn_slowpath_common+0x81/0xa0
 [<c018b3f7>] ? s_next+0x77/0x80
 [<c013e3fa>] warn_slowpath_null+0x1a/0x20
 [<c018b3f7>] s_next+0x77/0x80
 [<c01efa77>] traverse+0x117/0x200
 [<c01eff13>] seq_lseek+0xa3/0x120
 [<c01efe70>] ? seq_lseek+0x0/0x120
 [<c01d7081>] vfs_llseek+0x41/0x50
 [<c01d8116>] sys_llseek+0x66/0xa0
 [<c0102bd0>] sysenter_do_call+0x12/0x26

The iterator "leftover" variable is zeroed in the opening of the trace
file. But lseek can call s_start() which will call s_next() without
reseting the "leftover" variable back to zero, which might trigger
the WARN_ON_ONCE(iter->leftover) that is in s_next().

Cc: stable@kernel.org
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-02 21:11:47 -05:00
Linus Torvalds
832d30ca72 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (38 commits)
  SELinux: Make selinux_kernel_create_files_as() shouldn't just always return 0
  TOMOYO: Protect find_task_by_vpid() with RCU.
  Security: add static to security_ops and default_security_ops variable
  selinux: libsepol: remove dead code in check_avtab_hierarchy_callback()
  TOMOYO: Remove __func__ from tomoyo_is_correct_path/domain
  security: fix a couple of sparse warnings
  TOMOYO: Remove unneeded parameter.
  TOMOYO: Use shorter names.
  TOMOYO: Use enum for index numbers.
  TOMOYO: Add garbage collector.
  TOMOYO: Add refcounter on domain structure.
  TOMOYO: Merge headers.
  TOMOYO: Add refcounter on string data.
  TOMOYO: Reduce lines by using common path for addition and deletion.
  selinux: fix memory leak in sel_make_bools
  TOMOYO: Extract bitfield
  syslog: clean up needless comment
  syslog: use defined constants instead of raw numbers
  syslog: distinguish between /proc/kmsg and syscalls
  selinux: allow MLS->non-MLS and vice versa upon policy reload
  ...
2010-03-02 14:47:24 -08:00
H. Peter Anvin
f41496607e resource: Fix broken indentation
Fix broken indentation in patch
37b99dd537.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <20100301135551.GA9998@localhost>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-03-02 11:21:09 -08:00
Linus Torvalds
b7f3a209e9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6:
  sparc: Support show_unhandled_signals.
  sparc: use __ratelimit
  sunxvr500: Additional PCI id for sunxvr500 driver
  sparc: use asm-generic/scatterlist.h
  sparc64: If 'slot-names' property exist, create sysfs PCI slot information.
  sparc: remove trailing space in messages
  sparc: remove redundant return statements
2010-03-02 07:56:44 -08:00
Linus Torvalds
6d6b89bd2e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits)
  virtio_net: remove forgotten assignment
  be2net: fix tx completion polling
  sis190: fix cable detect via link status poll
  net: fix protocol sk_buff field
  bridge: Fix build error when IGMP_SNOOPING is not enabled
  bnx2x: Tx barriers and locks
  scm: Only support SCM_RIGHTS on unix domain sockets.
  vhost-net: restart tx poll on sk_sndbuf full
  vhost: fix get_user_pages_fast error handling
  vhost: initialize log eventfd context pointer
  vhost: logging thinko fix
  wireless: convert to use netdev_for_each_mc_addr
  ethtool: do not set some flags, if others failed
  ipoib: returned back addrlen check for mc addresses
  netlink: Adding inode field to /proc/net/netlink
  axnet_cs: add new id
  bridge: Make IGMP snooping depend upon BRIDGE.
  bridge: Add multicast count/interval sysfs entries
  bridge: Add hash elasticity/max sysfs entries
  bridge: Add multicast_snooping sysfs toggle
  ...

Trivial conflicts in Documentation/feature-removal-schedule.txt
2010-03-02 07:55:08 -08:00
Peter Zijlstra
320ebf09cb perf, x86: Restrict the ANY flag
The ANY flag can show SMT data of another task (like 'top'),
so we want to disable it when system-wide profiling is
disabled.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-02 15:06:46 +01:00
Peter Zijlstra
5671a10e2b nmi_watchdog: Tell the world we're active
Because I was wondering why perf stat wasn't working as expected..

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-02 15:02:05 +01:00
john stultz
ad6759fbf3 timekeeping: Prevent oops when GENERIC_TIME=n
Aaro Koskinen reported an issue in kernel.org bugzilla #15366, where
on non-GENERIC_TIME systems, accessing
/sys/devices/system/clocksource/clocksource0/current_clocksource
results in an oops.

It seems the timekeeper/clocksource rework missed initializing the
curr_clocksource value in the !GENERIC_TIME case.

Thanks to Aaro for reporting and diagnosing the issue as well as
testing the fix!

Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: stable@kernel.org
LKML-Reference: <1267475683.4216.61.camel@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-03-02 09:22:25 +01:00
Yinghai Lu
dce46a04d5 early_res: Need to save the allocation name in drop_range_partial()
During free_early_partial(), reserve_early_without_check() could end
extending the early_res area from __check_and_double_early_res(); as a
result, the location of the name for the current reservation could
change.

Therefore, we need to save a local copy of the name.

[ hpa: rewrote comment and checkin description ]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B8C7C94.7070000@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-03-01 23:23:02 -08:00
Wu Fengguang
37b99dd537 resource: Fix generic page_is_ram() for partial RAM pages
The System RAM walk shall skip partial RAM pages and avoid calling
func() on them. So that page_is_ram() return 0 for a partial RAM page.

In particular, it shall not call func() with len=0.
This fixes a boot time bug reported by Sachin and root caused by Thomas:

> >>> WARNING: at arch/x86/mm/ioremap.c:111 __ioremap_caller+0x169/0x2f1()
> >>> Hardware name: BladeCenter LS21 -[79716AA]-
> >>> Modules linked in:
> >>> Pid: 0, comm: swapper Not tainted 2.6.33-git6-autotest #1
> >>> Call Trace:
> >>> [<ffffffff81047cff>] ? __ioremap_caller+0x169/0x2f1
> >>> [<ffffffff81063b7d>] warn_slowpath_common+0x77/0xa4
> >>> [<ffffffff81063bb9>] warn_slowpath_null+0xf/0x11
> >>> [<ffffffff81047cff>] __ioremap_caller+0x169/0x2f1
> >>> [<ffffffff813747a3>] ? acpi_os_map_memory+0x12/0x1b
> >>> [<ffffffff81047f10>] ioremap_nocache+0x12/0x14
> >>> [<ffffffff813747a3>] acpi_os_map_memory+0x12/0x1b
> >>> [<ffffffff81282fa0>] acpi_tb_verify_table+0x29/0x5b
> >>> [<ffffffff812827f0>] acpi_load_tables+0x39/0x15a
> >>> [<ffffffff8191c8f8>] acpi_early_init+0x60/0xf5
> >>> [<ffffffff818f2cad>] start_kernel+0x397/0x3a7
> >>> [<ffffffff818f2295>] x86_64_start_reservations+0xa5/0xa9
> >>> [<ffffffff818f237a>] x86_64_start_kernel+0xe1/0xe8
> >>> ---[ end trace 4eaa2a86a8e2da22 ]---
> >>> ioremap reserve_memtype failed -22

The return code is -EINVAL, so it failed in the is_ram check, which is
not too surprising

> BIOS-provided physical RAM map:
>  BIOS-e820: 0000000000000000 - 000000000009c000 (usable)
>  BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved)
>  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
>  BIOS-e820: 0000000000100000 - 00000000cffa3900 (usable)
>  BIOS-e820: 00000000cffa3900 - 00000000cffa7400 (ACPI data)

The ACPI data is not starting on a page boundary and neither does the
usable RAM area end on a page boundary. Very useful !

> ACPI: DSDT 00000000cffa3900 036CE (v01 IBM    SERLEWIS 00001000 INTL 20060912)

ACPI is trying to map DSDT at cffa3900, which results in a check
vs. cffa3000 which is the relevant page boundary. The generic is_ram
check correctly identifies that as RAM because it's in the usable
resource area. The old e820 based is_ram check does not take
overlapping resource areas into account. That's why it works.

CC: Sachin Sant <sachinp@in.ibm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <20100301135551.GA9998@localhost>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-03-01 10:18:32 -08:00
Linus Torvalds
b1bf936840 Merge branch 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits)
  block: don't access jiffies when initialising io_context
  cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds
  block: fix for "Consolidate phys_segment and hw_segment limits"
  cfq-iosched: quantum check tweak
  blktrace: perform cleanup after setup error
  blkdev: fix merge_bvec_fn return value checks
  cfq-iosched: requests "in flight" vs "in driver" clarification
  cciss: Fix problem with scatter gather elements in the scsi half of the driver
  cciss: eliminate unnecessary pointer use in cciss scsi code
  cciss: do not use void pointer for scsi hba data
  cciss: factor out scatter gather chain block mapping code
  cciss: fix scatter gather chain block dma direction kludge
  cciss: simplify scatter gather code
  cciss: factor out scatter gather chain block allocation and freeing
  cciss: detect bad alignment of scsi commands at build time
  cciss: clarify command list padding calculation
  cfq-iosched: rethink seeky detection for SSDs
  cfq-iosched: rework seeky detection
  block: remove padding from io_context on 64bit builds
  block: Consolidate phys_segment and hw_segment limits
  ...
2010-03-01 09:00:29 -08:00
Linus Torvalds
0f45339794 Merge branches 'futexes-for-linus', 'irq-core-for-linus' and 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: Protect pid lookup in compat code with RCU

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Fix documentation of default chip disable()

* 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  nvram: Drop the BKL from nvram_open()
2010-03-01 08:51:52 -08:00
Linus Torvalds
e56425b135 Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  posix-timers.c: Don't export local functions
  clocksource: start CMT at clocksource resume
  clocksource: add suspend callback
  clocksource: add argument to resume callback
  ntp: Cleanup xtime references in ntp.c
  ntp: Make time_esterror and time_maxerror static
2010-03-01 08:48:25 -08:00
Paul E. McKenney
90a6501f94 sched, rcu: Fix rcu_dereference() for RCU-lockdep
Make rcu_dereference() of runqueue data structures be
rcu_dereference_sched().

Located-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <20100228163218.GD6846@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-01 09:29:58 +01:00
Ingo Molnar
e2f4699ac1 Merge branch 'linus' into core/rcu
Merge reason: Backmerge latest upstream to queue up dependent fix in the
              scheduler.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-01 09:28:58 +01:00
David S. Miller
4b17764737 sparc: Support show_unhandled_signals.
Just faults right now, will add other traps later.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-01 00:02:23 -08:00
David S. Miller
47871889c6 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/firmware/iscsi_ibft.c
2010-02-28 19:23:06 -08:00
James Morris
b4ccebdd37 Merge branch 'next' into for-linus 2010-03-01 09:36:31 +11:00
Frederic Weisbecker
1e259e0a99 hw-breakpoints: Remove stub unthrottle callback
We support event unthrottling in breakpoint events. It means
that if we have more than sysctl_perf_event_sample_rate/HZ,
perf will throttle, ignoring subsequent events until the next
tick.

So if ptrace exceeds this max rate, it will omit events, which
breaks the ptrace determinism that is supposed to report every
triggered breakpoints. This is likely to happen if we set
sysctl_perf_event_sample_rate to 1.

This patch removes support for unthrottling in breakpoint
events to break throttling and restore ptrace determinism.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: 2.6.33.x <stable@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
2010-02-28 20:51:15 +01:00
Linus Torvalds
6f5621cb16 Merge branch 'x86-ptrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-ptrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, ptrace: Remove set_stopped_child_used_math() in [x]fpregs_set
  x86, ptrace: Simplify xstateregs_get()
  ptrace: Fix ptrace_regset() comments and diagnose errors specifically
  parisc: Disable CONFIG_HAVE_ARCH_TRACEHOOK
  ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET
  x86, ptrace: regset extensions to support xstate
2010-02-28 10:59:44 -08:00
Dmitry Monakhov
9a8c28c831 blktrace: perform cleanup after setup error
Currently even if BLKTRACESETUP ioctl has failed user must call
BLKTRACETEARDOWN to be shure what all staff was cleaned, which
is contr-intuitive.
Let's setup ioctl make necessery cleanup by it self.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-28 19:47:19 +01:00
Frederic Weisbecker
ae1f30384b tracing: Include irqflags headers from trace clock
trace_clock.c includes spinlock.h, which ends up including
asm/system.h, which in turn includes linux/irqflags.h in x86.

So the definition of raw_local_irq_save is luckily covered there,
but this is not the case in parisc:

   tip/kernel/trace/trace_clock.c:86: error: implicit declaration of function 'raw_local_irq_save'
   tip/kernel/trace/trace_clock.c:112: error: implicit declaration of function 'raw_local_irq_restore'

We need to include linux/irqflags.h directly from trace_clock.c
to avoid such build error.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-28 19:45:01 +01:00
Linus Torvalds
46bbffad54 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, mm: Unify kernel_physical_mapping_init() API
  x86, mm: Allow highmem user page tables to be disabled at boot time
  x86: Do not reserve brk for DMI if it's not going to be used
  x86: Convert tlbstate_lock to raw_spinlock
  x86: Use the generic page_is_ram()
  x86: Remove BIOS data range from e820
  Move page_is_ram() declaration to mm.h
  Generic page_is_ram: use __weak
  resources: introduce generic page_is_ram()
2010-02-28 10:38:45 -08:00
Linus Torvalds
f66ffdedbf Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
  sched: Fix SCHED_MC regression caused by change in sched cpu_power
  sched: Don't use possibly stale sched_class
  kthread, sched: Remove reference to kthread_create_on_cpu
  sched: cpuacct: Use bigger percpu counter batch values for stats counters
  percpu_counter: Make __percpu_counter_add an inline function on UP
  sched: Remove member rt_se from struct rt_rq
  sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu]
  sched: Remove unused update_shares_locked()
  sched: Use for_each_bit
  sched: Queue a deboosted task to the head of the RT prio queue
  sched: Implement head queueing for sched_rt
  sched: Extend enqueue_task to allow head queueing
  sched: Remove USER_SCHED
  sched: Fix the place where group powers are updated
  sched: Assume *balance is valid
  sched: Remove load_balance_newidle()
  sched: Unify load_balance{,_newidle}()
  sched: Add a lock break for PREEMPT=y
  sched: Remove from fwd decls
  sched: Remove rq_iterator from move_one_task
  ...

Fix up trivial conflicts in kernel/sched.c
2010-02-28 10:31:01 -08:00
Linus Torvalds
2531216f23 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix race between ttwu() and task_rq_lock()
  sched: Fix SMT scheduler regression in find_busiest_queue()
  sched: Fix sched_mv_power_savings for !SMT
  kernel/sched.c: Suppress unused var warning
2010-02-28 10:23:41 -08:00
Linus Torvalds
6556a67435 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (172 commits)
  perf_event, amd: Fix spinlock initialization
  perf_event: Fix preempt warning in perf_clock()
  perf tools: Flush maps on COMM events
  perf_events, x86: Split PMU definitions into separate files
  perf annotate: Handle samples not at objdump output addr boundaries
  perf_events, x86: Remove superflous MSR writes
  perf_events: Simplify code by removing cpu argument to hw_perf_group_sched_in()
  perf_events, x86: AMD event scheduling
  perf_events: Add new start/stop PMU callbacks
  perf_events: Report the MMAP pgoff value in bytes
  perf annotate: Defer allocating sym_priv->hist array
  perf symbols: Improve debugging information about symtab origins
  perf top: Use a macro instead of a constant variable
  perf symbols: Check the right return variable
  perf/scripts: Tag syscall_name helper as not yet available
  perf/scripts: Add perf-trace-python Documentation
  perf/scripts: Remove unnecessary PyTuple resizes
  perf/scripts: Add syscall tracing scripts
  perf/scripts: Add Python scripting engine
  perf/scripts: Remove check-perf-trace from listed scripts
  ...

Fix trivial conflict in tools/perf/util/probe-event.c
2010-02-28 10:20:25 -08:00
Linus Torvalds
e0d272429a Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
  ftrace: Add function names to dangling } in function graph tracer
  tracing: Simplify memory recycle of trace_define_field
  tracing: Remove unnecessary variable in print_graph_return
  tracing: Fix typo of info text in trace_kprobe.c
  tracing: Fix typo in prof_sysexit_enable()
  tracing: Remove CONFIG_TRACE_POWER from kernel config
  tracing: Fix ftrace_event_call alignment for use with gcc 4.5
  ftrace: Remove memory barriers from NMI code when not needed
  tracing/kprobes: Add short documentation for HAVE_REGS_AND_STACK_ACCESS_API
  s390: Add pt_regs register and stack access API
  tracing/kprobes: Make Kconfig dependencies generic
  tracing: Unify arch_syscall_addr() implementations
  tracing: Add notrace to TRACE_EVENT implementation functions
  ftrace: Allow to remove a single function from function graph filter
  tracing: Add correct/incorrect to sort keys for branch annotation output
  tracing: Simplify test for function_graph tracing start point
  tracing: Drop the tr check from the graph tracing path
  tracing: Add stack dump to trace_printk if stacktrace option is set
  tracing: Use appropriate perl constructs in recordmcount.pl
  tracing: optimize recordmcount.pl for offsets-handling
  ...
2010-02-28 10:17:55 -08:00
Linus Torvalds
642c4c75a7 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits)
  rcu: Fix accelerated GPs for last non-dynticked CPU
  rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot
  rcu: Fix accelerated grace periods for last non-dynticked CPU
  rcu: Export rcu_scheduler_active
  rcu: Make rcu_read_lock_sched_held() take boot time into account
  rcu: Make lockdep_rcu_dereference() message less alarmist
  sched, cgroups: Fix module export
  rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
  rcu: Fix rcutorture mod_timer argument to delay one jiffy
  rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
  rcu: Convert to raw_spinlocks
  rcu: Stop overflowing signed integers
  rcu: Use canonical URL for Mathieu's dissertation
  rcu: Accelerate grace period if last non-dynticked CPU
  rcu: Fix citation of Mathieu's dissertation
  rcu: Documentation update for CONFIG_PROVE_RCU
  security: Apply lockdep-based checking to rcu_dereference() uses
  idr: Apply lockdep-based diagnostics to rcu_dereference() uses
  radix-tree: Disable RCU lockdep checking in radix tree
  vfs: Abstract rcu_dereference_check for files-fdtable use
  ...
2010-02-28 10:13:16 -08:00
Linus Torvalds
f91b22c35f Merge branches 'core-ipi-for-linus', 'core-locking-for-linus', 'tracing-fixes-for-linus', 'x86-debug-for-linus', 'x86-doc-for-linus', 'x86-gpu-for-linus' and 'x86-rlimit-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  plist: Fix grammar mistake, and c-style mistake

* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  kprobes: Add mcount to the kprobes blacklist

* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86_64: Print modules like i386 does

* 'x86-doc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Put 'nopat' in kernel-parameters

* 'x86-gpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86-64: Allow fbdev primary video code

* 'x86-rlimit-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Use helpers for rlimits
2010-02-28 10:04:02 -08:00
Paul E. McKenney
622ea685f1 rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
Make the holdoff only happen when the full number of attempts
have been made.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267311188-16603-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-28 09:17:42 +01:00
Tejun Heo
44ee63587d percpu: Add __percpu sparse annotations to hw_breakpoint
Add __percpu sparse annotations to hw_breakpoint.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.

In kernel/hw_breakpoint.c, per_cpu(nr_task_bp_pinned, cpu)'s will
trigger spurious noderef related warnings from sparse.  Changing it to
&per_cpu(nr_task_bp_pinned[0], cpu) will work around the problem but
deemed to ugly by the maintainer.  Leave it alone until better
solution can be found.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <4B7B4B7A.9050902@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-02-27 16:23:39 +01:00
Frederic Weisbecker
018cbffe68 Merge commit 'v2.6.33' into perf/core
Merge reason:
	__percpu annotations need the corresponding sparse address
space definition upstream.

Conflicts:
	tools/perf/util/probe-event.c (trivial)
2010-02-27 16:18:46 +01:00
Ingo Molnar
480917427b Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core 2010-02-27 10:41:16 +01:00
Ingo Molnar
6fb83029db Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core 2010-02-27 10:06:10 +01:00
Paul E. McKenney
71da81324c rcu: Fix accelerated GPs for last non-dynticked CPU
This patch disables irqs across the call to rcu_needs_cpu().  It
also enforces a hold-off period so that the idle loop doesn't
softirq itself to death when there are lots of RCU callbacks in
flight on the last non-dynticked CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267231138-27856-3-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-27 09:53:53 +01:00
Paul E. McKenney
a47cd880b5 rcu: Fix accelerated grace periods for last non-dynticked CPU
It is invalid to invoke __rcu_process_callbacks() with irqs
disabled, so do it indirectly via raise_softirq().  This
requires a state-machine implementation to cycle through the
grace-period machinery the required number of times.

Located-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267231138-27856-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-27 09:53:52 +01:00
Linus Torvalds
06a79b82b2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM / Hibernate: Fix preallocating of memory
  PM / Hibernate: Remove swsusp.c finally
  PM / Hibernate: Remove trailing space in message
  PM: Allow SCSI devices to suspend/resume asynchronously
  PM: Allow USB devices to suspend/resume asynchronously
  USB: implement non-tree resume ordering constraints for PCI host controllers
  PM: Allow PCI devices to suspend/resume asynchronously
  PM / Hibernate: Swap, remove useless check from swsusp_read()
  PM / Hibernate: Really deprecate deprecated user ioctls
  PM: Allow device drivers to use dpm_wait()
  PM: Start asynchronous resume threads upfront
  PM: Add facility for advanced testing of async suspend/resume
  PM: Add a switch for disabling/enabling asynchronous suspend/resume
  PM: Asynchronous suspend and resume of devices
  PM: Add parent information to timing messages
  PM: Document device power attributes in sysfs
  PM / Runtime: Add sysfs switch for disabling device run-time PM
2010-02-26 17:22:53 -08:00
Linus Torvalds
37d4008484 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (31 commits)
  crypto: aes_generic - Fix checkpatch errors
  crypto: fcrypt - Fix checkpatch errors
  crypto: ecb - Fix checkpatch errors
  crypto: des_generic - Fix checkpatch errors
  crypto: deflate - Fix checkpatch errors
  crypto: crypto_null - Fix checkpatch errors
  crypto: cipher - Fix checkpatch errors
  crypto: crc32 - Fix checkpatch errors
  crypto: compress - Fix checkpatch errors
  crypto: cast6 - Fix checkpatch errors
  crypto: cast5 - Fix checkpatch errors
  crypto: camellia - Fix checkpatch errors
  crypto: authenc - Fix checkpatch errors
  crypto: api - Fix checkpatch errors
  crypto: anubis - Fix checkpatch errors
  crypto: algapi - Fix checkpatch errors
  crypto: blowfish - Fix checkpatch errors
  crypto: aead - Fix checkpatch errors
  crypto: ablkcipher - Fix checkpatch errors
  crypto: pcrypt - call the complete function on error
  ...
2010-02-26 16:50:02 -08:00
Steven Rostedt
f1c7f517a5 ftrace: Add function names to dangling } in function graph tracer
The function graph tracer is currently the most invasive tracer
in the ftrace family. It can easily overflow the buffer even with
10megs per CPU. This means that events can often be lost.

On start up, or after events are lost, if the function return is
recorded but the function enter was lost, all we get to see is the
exiting '}'.

Here is how a typical trace output starts:

 [tracing] cat trace
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  0) + 91.897 us   |                  }
  0) ! 567.961 us  |                }
  0)   <========== |
  0) ! 579.083 us  |                _raw_spin_lock_irqsave();
  0)   4.694 us    |                _raw_spin_unlock_irqrestore();
  0) ! 594.862 us  |              }
  0) ! 603.361 us  |            }
  0) ! 613.574 us  |          }
  0) ! 623.554 us  |        }
  0)   3.653 us    |        fget_light();
  0)               |        sock_poll() {

There are a series of '}' with no matching "func() {". There's no information
to what functions these ending brackets belong to.

This patch adds a stack on the per cpu structure used in outputting
the function graph tracer to keep track of what function was outputted.
Then on a function exit event, it checks the depth to see if the
function exit has a matching entry event. If it does, then it only
prints the '}', otherwise it adds the function name after the '}'.

This allows function exit events to show what function they belong to
at trace output startup, when the entry was lost due to ring buffer
overflow, or even after a new task is scheduled in.

Here is what the above trace will look like after this patch:

 [tracing] cat trace
 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |
  0) + 91.897 us   |                  } (irq_exit)
  0) ! 567.961 us  |                } (smp_apic_timer_interrupt)
  0)   <========== |
  0) ! 579.083 us  |                _raw_spin_lock_irqsave();
  0)   4.694 us    |                _raw_spin_unlock_irqrestore();
  0) ! 594.862 us  |              } (add_wait_queue)
  0) ! 603.361 us  |            } (__pollwait)
  0) ! 613.574 us  |          } (tcp_poll)
  0) ! 623.554 us  |        } (sock_poll)
  0)   3.653 us    |        fget_light();
  0)               |        sock_poll() {

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-26 19:25:53 -05:00
Rafael J. Wysocki
a9c9b4429d PM / Hibernate: Fix preallocating of memory
The hibernate memory preallocation code allocates memory to push some
user space data out of physical RAM, so that the hibernation image is
not too large.  It allocates more memory than necessary for creating
the image, so it has to release some pages to make room for
allocations made while suspending devices and disabling nonboot CPUs,
or the system will hang due to the lack of free pages to allocate
from.  Unfortunately, the function used for freeing these pages,
free_unnecessary_pages(), contains a bug that prevents it from doing
the job on all systems without highmem.

Fix this problem, which is a regression from the 2.6.30 kernel, by
using the right condition for the termination of the loop in
free_unnecessary_pages().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-and-tested-by: Alan Jenkins <sourcejedi.lkml@googlemail.com>
Cc: stable@kernel.org
2010-02-26 20:39:13 +01:00
Jiri Slaby
f8bb0db818 PM / Hibernate: Remove swsusp.c finally
Its contents and entry in Makefile were already removed in
8e60c6a134
(Shift remaining code from swsusp.c to hibernate.c)
but somehow it remained in-place (rjw: which most likely was my
mistake).

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-02-26 20:39:13 +01:00
Frans Pop
07c3bb5797 PM / Hibernate: Remove trailing space in message
Remove a trailing space from a message in swsusp_save().

Signed-off-by: Frans Pop <elendil@planet.nl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-02-26 20:39:13 +01:00
Jiri Slaby
09c09bc618 PM / Hibernate: Swap, remove useless check from swsusp_read()
It will never reach here if the sws_resume_bdev is erratic.
swsusp_read() is called only from software_resume(), but after
swsusp_check() which would catch the error state.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-02-26 20:39:11 +01:00
Jiri Slaby
b694e52ebd PM / Hibernate: Really deprecate deprecated user ioctls
They were deprecated and removed from exported headers more than 2
years ago. Inform users about their removal in the future now.

(Switch cases needed to be reorderded for an easy fall through.)

And add an entry to feature-removal-schedule.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-02-26 20:39:11 +01:00
Rafael J. Wysocki
5a2eb8585f PM: Add facility for advanced testing of async suspend/resume
Add configuration switch CONFIG_PM_ADVANCED_DEBUG for compiling in
extra PM debugging/testing code allowing one to access some
PM-related attributes of devices from the user space via sysfs.

If CONFIG_PM_ADVANCED_DEBUG is set, add sysfs attribute power/async
for every device allowing the user space to access the device's
power.async_suspend flag and modify it, if desired.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-02-26 20:39:10 +01:00
Rafael J. Wysocki
0e06b4a891 PM: Add a switch for disabling/enabling asynchronous suspend/resume
Add sysfs attribute /sys/power/pm_async allowing the user space to
disable/enable asynchronous suspend/resume of devices.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-02-26 20:39:10 +01:00
Linus Torvalds
68c6b85984 Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (48 commits)
  x86/PCI: Prevent mmconfig memory corruption
  ACPI: Use GPE reference counting to support shared GPEs
  x86/PCI: use host bridge _CRS info by default on 2008 and newer machines
  PCI: augment bus resource table with a list
  PCI: add pci_bus_for_each_resource(), remove direct bus->resource[] refs
  PCI: read bridge windows before filling in subtractive decode resources
  PCI: split up pci_read_bridge_bases()
  PCIe PME: use pci_pcie_cap()
  PCI PM: Run-time callbacks for PCI bus type
  PCIe PME: use pci_is_pcie()
  PCI / ACPI / PM: Platform support for PCI PME wake-up
  ACPI / ACPICA: Multiple system notify handlers per device
  ACPI / PM: Add more run-time wake-up fields
  ACPI: Use GPE reference counting to support shared GPEs
  PCI PM: Make it possible to force using INTx for PCIe PME signaling
  PCI PM: PCIe PME root port service driver
  PCI PM: Add function for checking PME status of devices
  PCI: mark is_pcie obsolete
  PCI: set PCI_PREF_RANGE_TYPE_64 in pci_bridge_check_ranges
  PCI: pciehp: second try to get big range for pcie devices
  ...
2010-02-26 10:35:27 -08:00
Peter Zijlstra
24691ea964 perf_event: Fix preempt warning in perf_clock()
A recent commit introduced a preemption warning for
perf_clock(), use raw_smp_processor_id() to avoid this, it
really doesn't matter which cpu we use here.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1267198583.22519.684.camel@laptop>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 17:25:00 +01:00
Suresh Siddha
dd5feea14a sched: Fix SCHED_MC regression caused by change in sched cpu_power
On platforms like dual socket quad-core platform, the scheduler load
balancer is not detecting the load imbalances in certain scenarios. This
is leading to scenarios like where one socket is completely busy (with
all the 4 cores running with 4 tasks) and leaving another socket
completely idle. This causes performance issues as those 4 tasks share
the memory controller, last-level cache bandwidth etc. Also we won't be
taking advantage of turbo-mode as much as we would like, etc.

Some of the comparisons in the scheduler load balancing code are
comparing the "weighted cpu load that is scaled wrt sched_group's
cpu_power" with the "weighted average load per task that is not scaled
wrt sched_group's cpu_power". While this has probably been broken for a
longer time (for multi socket numa nodes etc), the problem got aggrevated
via this recent change:

 |
 |  commit f93e65c186
 |  Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
 |  Date:   Tue Sep 1 10:34:32 2009 +0200
 |
 |	sched: Restore __cpu_power to a straight sum of power
 |

Also with this change, the sched group cpu power alone no longer reflects
the group capacity that is needed to implement MC, MT performance
(default) and power-savings (user-selectable) policies.

We need to use the computed group capacity (sgs.group_capacity, that is
computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
find out if the group with the max load is above its capacity and how
much load to move etc.

Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
[ -v2: build fix ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x]
LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 15:45:13 +01:00
Peter Zijlstra
6e37738a2f perf_events: Simplify code by removing cpu argument to hw_perf_group_sched_in()
Since the cpu argument to hw_perf_group_sched_in() is always
smp_processor_id(), simplify the code a little by removing this argument
and using the current cpu where needed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1265890918.5396.3.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 10:56:53 +01:00
Stephane Eranian
38331f62c2 perf_events, x86: AMD event scheduling
This patch adds correct AMD NorthBridge event scheduling.

NB events are events measuring L3 cache, Hypertransport traffic. They are
identified by an event code >= 0xe0. They measure events on the
Northbride which is shared by all cores on a package. NB events are
counted on a shared set of counters. When a NB event is programmed in a
counter, the data actually comes from a shared counter. Thus, access to
those counters needs to be synchronized.

We implement the synchronization such that no two cores can be measuring
NB events using the same counters. Thus, we maintain a per-NB allocation
table. The available slot is propagated using the event_constraint
structure.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4b703957.0702d00a.6bf2.7b7d@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 10:56:53 +01:00
Stephane Eranian
d76a0812ac perf_events: Add new start/stop PMU callbacks
In certain situations, the kernel may need to stop and start the same
event rapidly. The current PMU callbacks do not distinguish between stop
and release (i.e., stop + free the resource). Thus, a counter may be
released, then it will be immediately re-acquired. Event scheduling will
again take place with no guarantee to assign the same counter. On some
processors, this may event yield to failure to assign the event back due
to competion between cores.

This patch is adding a new pair of callback to stop and restart a counter
without actually release the underlying counter resource. On stop, the
counter is stopped, its values saved and that's it. On start, the value
is reloaded and counter is restarted (on x86, actual restart is delayed
until perf_enable()).

Signed-off-by: Stephane Eranian <eranian@google.com>
[ added fallback to ->enable/->disable for all other PMUs
  fixed x86_pmu_start() to call x86_pmu.enable()
  merged __x86_pmu_disable into x86_pmu_stop() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4b703875.0a04d00a.7896.ffffb824@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 10:56:53 +01:00
Peter Zijlstra
3a0304e90a perf_events: Report the MMAP pgoff value in bytes
DaveM reported that currently perf interprets the pgoff value reported by
the MMAP events as a byte range, but the kernel reports it as a page
offset.

Since its broken (and unusable) anyway, change the kernel behaviour (ABI)
to report bytes indeed, avoiding the need for userspace to deal with
PAGE_SIZE things.

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 10:56:52 +01:00
Ingo Molnar
281b3714e9 Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core 2010-02-26 09:20:17 +01:00
Ingo Molnar
64b9fb5704 Merge commit 'v2.6.33' into tracing/core
Conflicts:
	scripts/recordmcount.pl

Merge reason: Merge up to v2.6.33.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 09:18:32 +01:00
Yinghai Lu
fb90ef93df early_res: Add free_early_partial()
To free partial areas in pcpu_setup...

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <4B85E245.5030001@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 08:25:35 +01:00
Paul E. McKenney
f5f6540964 rcu: Export rcu_scheduler_active
Kernel modules using rcu_read_lock_sched_held() must now have
access to rcu_scheduler_active, so it must be exported.

This should fix the fix for the boot-time RCU-lockdep splat.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <20100226030230.GA7743@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 08:20:46 +01:00
Paul E. McKenney
d9f1bb6ad7 rcu: Make rcu_read_lock_sched_held() take boot time into account
Before the scheduler starts, all tasks are non-preemptible by
definition. So, during that time, rcu_read_lock_sched_held()
needs to always return "true".  This patch makes that be so.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267135607-7056-2-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 08:20:46 +01:00
Paul E. McKenney
056ba4a9be rcu: Make lockdep_rcu_dereference() message less alarmist
Change from "unsafe" to "suspicious", given that there will be
false alarms.

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267135607-7056-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-26 08:20:46 +01:00
Masami Hiramatsu
b2be84df99 kprobes: Jump optimization sysctl interface
Add /proc/sys/debug/kprobes-optimization sysctl which enables
and disables kprobes jump optimization on the fly for debugging.

Changes in v7:
 - Remove ctl_name = CTL_UNNUMBERED for upstream compatibility.

Changes in v6:
- Update comments and coding style.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <20100225133415.6725.8274.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 17:49:25 +01:00
Masami Hiramatsu
afd66255b9 kprobes: Introduce kprobes jump optimization
Introduce kprobes jump optimization arch-independent parts.
Kprobes uses breakpoint instruction for interrupting execution
flow, on some architectures, it can be replaced by a jump
instruction and interruption emulation code. This gains kprobs'
performance drastically.

To enable this feature, set CONFIG_OPTPROBES=y (default y if the
arch supports OPTPROBE).

Changes in v9:
 - Fix a bug to optimize probe when enabling.
 - Check nearby probes can be optimize/unoptimize when disarming/arming
   kprobes, instead of registering/unregistering. This will help
   kprobe-tracer because most of probes on it are usually disabled.

Changes in v6:
 - Cleanup coding style for readability.
 - Add comments around get/put_online_cpus().

Changes in v5:
 - Use get_online_cpus()/put_online_cpus() for avoiding text_mutex
   deadlock.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <20100225133407.6725.81992.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 17:49:24 +01:00
Masami Hiramatsu
4610ee1d36 kprobes: Introduce generic insn_slot framework
Make insn_slot framework support various size slots.
Current insn_slot just supports one-size instruction buffer
slot. However, kprobes jump optimization needs larger size
buffers.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <20100225133358.6725.82430.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Anders Kaseorg <andersk@ksplice.com>
Cc: Tim Abbott <tabbott@ksplice.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
2010-02-25 17:49:24 +01:00
Wenji Huang
7b60997f73 tracing: Simplify memory recycle of trace_define_field
Discard freeing field->type since it is not necessary.

Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-5-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-25 10:42:55 -05:00
Wenji Huang
c85f3a91f8 tracing: Remove unnecessary variable in print_graph_return
The "cpu" variable is declared at the start of the function and
also within a branch, with the exact same initialization.

Remove the local variable of the same name in the branch.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-3-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-25 10:41:24 -05:00
Wenji Huang
a5efd92511 tracing: Fix typo of info text in trace_kprobe.c
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-2-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-25 10:36:29 -05:00
Wenji Huang
6574658b3b tracing: Fix typo in prof_sysexit_enable()
Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
LKML-Reference: <1266997226-6833-1-git-send-email-wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-25 10:35:55 -05:00
Li Zefan
1ab83a8941 tracing: Remove CONFIG_TRACE_POWER from kernel config
The power tracer has been converted to power trace events.

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B84D50E.4070806@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-25 10:31:45 -05:00
Jeff Mahoney
86c38a31aa tracing: Fix ftrace_event_call alignment for use with gcc 4.5
GCC 4.5 introduces behavior that forces the alignment of structures to
 use the largest possible value. The default value is 32 bytes, so if
 some structures are defined with a 4-byte alignment and others aren't
 declared with an alignment constraint at all - it will align at 32-bytes.

 For things like the ftrace events, this results in a non-standard array.
 When initializing the ftrace subsystem, we traverse the _ftrace_events
 section and call the initialization callback for each event. When the
 structures are misaligned, we could be treating another part of the
 structure (or the zeroed out space between them) as a function pointer.

 This patch forces the alignment for all the ftrace_event_call structures
 to 4 bytes.

 Without this patch, the kernel fails to boot very early when built with
 gcc 4.5.

 It's trivial to check the alignment of the members of the array, so it
 might be worthwhile to add something to the build system to do that
 automatically. Unfortunately, that only covers this case. I've asked one
 of the gcc developers about adding a warning when this condition is seen.

Cc: stable@kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
LKML-Reference: <4B85770B.6010901@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-25 09:38:11 -05:00
Don Zickus
47195d5763 nmi_watchdog: Clean up various small details
Mostly copy/paste whitespace damage with a couple of nitpicks by
the checkpatch script. Fix the struct definition as requested by Ingo too.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
LKML-Reference: <1266880143-24943-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
 arch/x86/kernel/apic/hw_nmi.c |   14 +++++------
 arch/x86/kernel/traps.c       |    6 ++--
 include/linux/nmi.h           |    2 -
 kernel/nmi_watchdog.c         |   51 ++++++++++++++++++++----------------------
 4 files changed, 36 insertions(+), 37 deletions(-)
2010-02-25 12:40:50 +01:00
Ingo Molnar
c50cc75271 sched, cgroups: Fix module export
I have exported it in d11c563 - but cgroups.c did not have module.h included ...

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 12:02:13 +01:00
Paul E. McKenney
1ed509a225 rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
When RCU detects a grace-period stall, it currently just prints
out the PID of any tasks doing the stalling.  This patch adds
RCU_CPU_STALL_VERBOSE, which enables the more-verbose reporting
from sched_show_task().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-21-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:35:02 +01:00
Paul E. McKenney
6155fec92e rcu: Fix rcutorture mod_timer argument to delay one jiffy
The current "mod_timer(&t, 1)" potentially makes the timer fire
immediately, change this to wait one jiffy.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-20-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:35:00 +01:00
Paul E. McKenney
3acd9eb31c rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
Under TREE_PREEMPT_RCU, print_other_cpu_stall() invokes
rcu_print_task_stall() with the root rcu_node structure's ->lock
held, and rcu_print_task_stall() acquires that same lock for
self-deadlock. Fix this by removing the lock acquisition from
rcu_print_task_stall(), and making all callers acquire the lock
instead.

Tested-by: John Kacur <jkacur@redhat.com>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Located-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-19-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:59 +01:00
Paul E. McKenney
1304afb225 rcu: Convert to raw_spinlocks
The spinlocks in rcutree need to be real spinlocks in
preempt-rt. Convert them to raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-18-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:58 +01:00
Paul E. McKenney
20133cfce7 rcu: Stop overflowing signed integers
The C standard does not specify the result of an operation that
overflows a signed integer, so such operations need to be
avoided.  This patch changes the type of several fields from
"long" to "unsigned long" and adjusts operations as needed.
ULONG_CMP_GE() and ULONG_CMP_LT() macros are introduced to do
the modular comparisons that are appropriate given that overflow
is an expected event.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-17-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:57 +01:00
Paul E. McKenney
8bd93a2c5d rcu: Accelerate grace period if last non-dynticked CPU
Currently, rcu_needs_cpu() simply checks whether the current CPU
has an outstanding RCU callback, which means that the last CPU
to go into dyntick-idle mode might wait a few ticks for the
relevant grace periods to complete.  However, if all the other
CPUs are in dyntick-idle mode, and if this CPU is in a quiescent
state (which it is for RCU-bh and RCU-sched any time that we are
considering going into dyntick-idle mode), then the grace period
is instantly complete.

This patch therefore repeatedly invokes the RCU grace-period
machinery in order to force any needed grace periods to complete
quickly.  It does so a limited number of times in order to
prevent starvation by an RCU callback function that might pass
itself to call_rcu().

However, if any CPU other than the current one is not in
dyntick-idle mode, fall back to simply checking (with fix to bug
noted by Lai Jiangshan).  Also, take advantage of last
grace-period forcing, the opportunity to do so noted by Steve
Rostedt.  And apply simplified #ifdef condition suggested by
Frederic Weisbecker.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-15-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:55 +01:00
Paul E. McKenney
497f0ab39c sched: Better name for for_each_domain_rd
As suggested by Peter Ziljstra, make better choice of name
for for_each_domain_rd(), containing "rcu_dereference", given
that it is but a wrapper for rcu_dereference_check().  The name
rcu_dereference_check_sched_domain() does that and provides a
separate per-subsystem name space.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-7-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:47 +01:00
Paul E. McKenney
d11c563dd2 sched: Use lockdep-based checking on rcu_dereference()
Update the rcu_dereference() usages to take advantage of the new
lockdep-based checking.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
[ -v2: fix allmodconfig missing symbol export build failure on x86 ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:26 +01:00
Paul E. McKenney
0632eb3d75 rcu: Integrate rcu_dereference_check() message into lockdep
Make rcu_dereference_check() print the list of held locks in
addition to the stack dump to ease debugging.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-3-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 09:41:01 +01:00
Paul E. McKenney
632ee20013 rcu: Introduce lockdep-based checking to RCU read-side primitives
Inspection is proving insufficient to catch all RCU misuses,
which is understandable given that rcu_dereference() might be
protected by any of four different flavors of RCU (RCU, RCU-bh,
RCU-sched, and SRCU), and might also/instead be protected by any
of a number of locking primitives. It is therefore time to
enlist the aid of lockdep.

This set of patches is inspired by earlier work by Peter
Zijlstra and Thomas Gleixner, and takes the following approach:

o	Set up separate lockdep classes for RCU, RCU-bh, and RCU-sched.

o	Set up separate lockdep classes for each instance of SRCU.

o	Create primitives that check for being in an RCU read-side
	critical section.  These return exact answers if lockdep is
	fully enabled, but if unsure, report being in an RCU read-side
	critical section.  (We want to avoid false positives!)
	The primitives are:

	For RCU: rcu_read_lock_held(void)

	For RCU-bh: rcu_read_lock_bh_held(void)

	For RCU-sched: rcu_read_lock_sched_held(void)

	For SRCU: srcu_read_lock_held(struct srcu_struct *sp)

o	Add rcu_dereference_check(), which takes a second argument
	in which one places a boolean expression based on the above
	primitives and/or lockdep_is_held().

o	A new kernel configuration parameter, CONFIG_PROVE_RCU, enables
	rcu_dereference_check().  This depends on CONFIG_PROVE_LOCKING,
	and should be quite helpful during the transition period while
	CONFIG_PROVE_RCU-unaware patches are in flight.

The existing rcu_dereference() primitive does no checking, but
upcoming patches will change that.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 09:40:59 +01:00
Ingo Molnar
996de8c6fe Merge commit 'v2.6.33' into core/rcu
Merge reason: Update from -rc4 to -final.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 09:40:26 +01:00
Suresh Siddha
c6a0dd7ec6 ptrace: Fix ptrace_regset() comments and diagnose errors specifically
Return -EINVAL for the bad size and for unrecognized NT_* type in
ptrace_regset() instead of -EIO.

Also update the comments for this ptrace interface with more clarifications.

Requested-by: Roland McGrath <roland@redhat.com>
Requested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100222225240.397523600@sbs-t61.sc.intel.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-23 13:45:26 -08:00
Tetsuo Handa
701188374b kernel/sys.c: fix missing rcu protection for sys_getpriority()
find_task_by_vpid() is not safe without rcu_read_lock().  2.6.33-rc7 got
RCU protection for sys_setpriority() but missed it for sys_getpriority().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-22 19:50:34 -08:00
Rafael J. Wysocki
6cbf82148f PCI PM: Run-time callbacks for PCI bus type
Introduce run-time PM callbacks for the PCI bus type.  Make the new
callbacks work in analogy with the existing system sleep PM
callbacks, so that the drivers already converted to struct dev_pm_ops
can use their suspend and resume routines for run-time PM without
modifications.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-02-22 16:21:19 -08:00
Yinghai Lu
5eeec0ec93 resource: add release_child_resources
Useful for freeing a portion of the resource tree, e.g. when trying to
reallocate resources more efficiently.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-02-22 16:17:00 -08:00
Dominik Brodowski
3b7a17fcda resource/PCI: mark struct resource as const
Now that we return the new resource start position, there is no
need to update "struct resource" inside the align function.
Therefore, mark the struct resource as const.

Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-02-22 16:16:57 -08:00
Dominik Brodowski
b26b2d494b resource/PCI: align functions now return start of resource
As suggested by Linus, align functions should return the start
of a resource, not void. An update of "res->start" is no longer
necessary.

Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-02-22 16:16:56 -08:00
Linus Torvalds
bee415ce42 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf probe: Init struct probe_point and set counter correctly
  hw-breakpoint: Keep track of dr7 local enable bits
  hw-breakpoints: Accept breakpoints on NULL address
  perf_events: Fix FORK events
2010-02-22 08:55:32 -08:00
Alexey Dobriyan
b54452b07a const: struct nla_policy
Make remaining netlink policies as const.
Fixup coding style where needed.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-18 14:30:18 -08:00
Yinghai Lu
b5eb78f76d sparseirq: Use radix_tree instead of ptrs array
Use radix_tree irq_desc_tree instead of irq_desc_ptrs.

-v2: according to Eric and cyrill to use radix_tree_lookup_slot and
     radix_tree_replace_slot

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-32-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-17 17:27:20 -08:00
Yinghai Lu
99558f0bbe sparseirq: Change irq_desc_ptrs to static
Add replace_irq_desc() instead of poking at the array directly.

-v2: remove unneeded boundary check in replace_irq_desc

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-31-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-17 17:27:03 -08:00
Yinghai Lu
febcb0c59a irq: Remove unnecessary bootmem code
mem_init is moved early already.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-29-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-17 17:23:59 -08:00
Thomas Gleixner
b7e56edba4 Merge branch 'linus' into x86/mm
x86/mm is on 32-rc4 and missing the spinlock namespace changes which
are needed for further commits into this topic.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-17 18:28:05 +01:00
Heiko Carstens
f850c30c8b tracing/kprobes: Make Kconfig dependencies generic
KPROBES_EVENT actually depends on the regs and stack access API
(b1cf540f) and not on x86.
So introduce a new config option which architectures can select if
they have the API implemented and switch x86.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <20100210162517.GB6933@osiris.boeblingen.de.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-02-17 13:13:08 +01:00
Mike Frysinger
e7b8e675d9 tracing: Unify arch_syscall_addr() implementations
Most implementations of arch_syscall_addr() are the same, so create a
default version in common code and move the one piece that differs (the
syscall table) to asm/syscall.h.  New arch ports don't have to waste
time copying & pasting this simple function.

The s390/sparc versions need to be different, so document why.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1264498803-17278-1-git-send-email-vapier@gentoo.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-02-17 13:07:21 +01:00
Thomas Gleixner
83ab0aa0d5 sched: Don't use possibly stale sched_class
setscheduler() saves task->sched_class outside of the rq->lock held
region for a check after the setscheduler changes have become
effective. That might result in checking a stale value.

rtmutex_setprio() has the same problem, though it is protected by
p->pi_lock against setscheduler(), but for correctness sake (and to
avoid bad examples) it needs to be fixed as well.

Retrieve task->sched_class inside of the rq->lock held region.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
2010-02-17 11:58:18 +01:00
Don Zickus
96ca4028ac nmi_watchdog: Properly configure for software events
Paul Mackerras brought up a good point that when fallbacking to
software events, I may have been lucky in my configuration.

Modified the code to explicit provide a new configuration for
software events.

Suggested-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1266357745-26671-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-17 08:16:02 +01:00
Don Zickus
6081b6cd97 nmi_watchdog: support for oprofile
Re-arrange the code so that when someone disables nmi_watchdog
with:

  echo 0 > /proc/sys/kernel/nmi_watchdog

it releases the hardware reservation on the PMUs.  This allows
the oprofile module to grab those PMUs and do its thing.
Otherwise oprofile fails to load because the hardware is
reserved by the perf_events subsystem.

Tested using:

  oprofile --vm-linux --start

and watched it failed when nmi_watchdog is enabled and succeed
when:

  oprofile --deinit && echo 0 > /proc/sys/kernel/nmi_watchdog

is run.

Note:  this has the side quirk of having the nmi_watchdog latch
onto the software events instead of hardware events if oprofile
has already reserved the hardware first.  User beware! :-)

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: eranian@google.com
LKML-Reference: <1266357892-30504-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-17 08:16:02 +01:00
Yinghai Lu
580e0ad21d core: Move early_res from arch/x86 to kernel/
This makes the range reservation feature available to other
architectures.

-v2: add get_max_mapped, max_pfn_mapped only defined in x86...
     to fix PPC compiling
-v3: according to hpa, add CONFIG_HAVE_EARLY_RES
-v4: fix typo about EARLY_RES in config

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B7B5723.4070009@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-16 21:43:39 -08:00
Tejun Heo
43cf38eb5c percpu: add __percpu sparse annotations to core kernel subsystems
Add __percpu sparse annotations to core subsystems.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-mm@kvack.org
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Biederman <ebiederm@xmission.com>
2010-02-17 11:17:38 +09:00
Anton Vorontsov
5a5e0f4c70 kfifo: Don't use integer as NULL pointer
This patch fixes following sparse warnings:

include/linux/kfifo.h:127:25: warning: Using plain integer as NULL pointer
kernel/kfifo.c:83:21: warning: Using plain integer as NULL pointer

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-16 15:11:08 -08:00
Anton Vorontsov
1a02d59aba kfifo: Make kfifo_initialized work after kfifo_free
After kfifo rework it's no longer possible to reliably know if kfifo is
usable, since after kfifo_free(), kfifo_initialized() would still return
true. The correct behaviour is needed for at least FHCI USB driver.

This patch fixes the issue by resetting the kfifo to zero values (the
same approach is used in kfifo_alloc() if allocation failed).

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-16 15:11:06 -08:00
Thomas Gleixner
6e40f5bbbc Merge branch 'sched/urgent' into sched/core
Conflicts: kernel/sched.c

Necessary due to the urgent fixes which conflict with the code move
from sched.c to sched_fair.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 16:48:56 +01:00
Peter Zijlstra
0970d2992d sched: Fix race between ttwu() and task_rq_lock()
Thomas found that due to ttwu() changing a task's cpu without holding
the rq->lock, task_rq_lock() might end up locking the wrong rq.

Avoid this by serializing against TASK_WAKING.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266241712.15770.420.camel@laptop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 15:13:59 +01:00
Suresh Siddha
9000f05c6d sched: Fix SMT scheduler regression in find_busiest_queue()
Fix a SMT scheduler performance regression that is leading to a scenario
where SMT threads in one core are completely idle while both the SMT threads
in another core (on the same socket) are busy.

This is caused by this commit (with the problematic code highlighted)

   commit bdb94aa5db
   Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
   Date:   Tue Sep 1 10:34:38 2009 +0200

   sched: Try to deal with low capacity

   @@ -4203,15 +4223,18 @@ find_busiest_queue()
   ...
	for_each_cpu(i, sched_group_cpus(group)) {
   +	unsigned long power = power_of(i);

   ...

   -	wl = weighted_cpuload(i);
   +	wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
   +	wl /= power;

   -	if (rq->nr_running == 1 && wl > imbalance)
   +	if (capacity && rq->nr_running == 1 && wl > imbalance)
		continue;

On a SMT system, power of the HT logical cpu will be 589 and
the scheduler load imbalance (for scenarios like the one mentioned above)
can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
the weighted load with the power will result in "wl > imbalance" and
ultimately resulting in find_busiest_queue() return NULL, causing
load_balance() to think that the load is well balanced. But infact
one of the tasks can be moved to the idle core for optimal performance.

We don't need to use the weighted load (wl) scaled by the cpu power to
compare with  imabalance. In that condition, we already know there is only a
single task "rq->nr_running == 1" and the comparison between imbalance,
wl is to make sure that we select the correct priority thread which matches
imbalance. So we really need to compare the imabalnce with the original
weighted load of the cpu and not the scaled load.

But in other conditions where we want the most hammered(busiest) cpu, we can
use scaled load to ensure that we consider the cpu power in addition to the
actual load on that cpu, so that we can move the load away from the
guy that is getting most hammered with respect to the actual capacity,
as compared with the rest of the cpu's in that busiest group.

Fix it.

Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
Cc: stable@kernel.org [2.6.32.x]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 15:13:59 +01:00
Linus Torvalds
7d0bab9dfe Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimer, softirq: Fix hrtimer->softirq trampoline
2010-02-15 19:52:12 -08:00
Linus Torvalds
627a9a194d Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing/kprobes: Fix probe parsing
  tracing: Fix circular dead lock in stack trace
2010-02-15 19:47:59 -08:00
Linus Torvalds
3d8b4bdef7 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf top: Fix help text alignment
  perf: Fix hypervisor sample reporting
  perf: Make bp_len type to u64 generic across the arch
2010-02-15 19:47:48 -08:00
Uwe Kleine-König
dfff0615d2 tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-15 15:38:10 +01:00
Masami Hiramatsu
8b833c506c kprobes: Add mcount to the kprobes blacklist
Since mcount function can be called from everywhere,
it should be blacklisted. Moreover, the "mcount" symbol
is a special symbol name. So, it is better to put it in
the generic blacklist.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-15 05:45:49 +01:00
Peter Zijlstra
6f93d0a7c8 perf_events: Fix FORK events
Commit 22e19085 ("Honour event state for aux stream data")
introduced a bug where we would drop FORK events.

The thing is that we deliver FORK events to the child process'
event, which at that time will be PERF_EVENT_STATE_INACTIVE
because the child won't be scheduled in (we're in the middle of
fork).

Solve this twice, change the event state filter to exclude only
disabled (STATE_OFF) or worse, and deliver FORK events to the
current (parent).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
LKML-Reference: <1266142324.5273.411.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-14 18:10:39 +01:00
Heiko Carstens
a9bb18f36c tracing/kprobes: Fix probe parsing
Trying to add a probe like:

  echo p:myprobe 0x10000 > /sys/kernel/debug/tracing/kprobe_events

will fail since the wrong pointer is passed to strict_strtoul
when trying to convert the address to an unsigned long.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100210162346.GA6933@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-14 09:43:58 +01:00
Don Zickus
cf454aecb3 nmi_watchdog: Fallback to software events when no hardware pmu detected
Not all arches have a PMU or have perf_event support for their
PMU.  The nmi_watchdog will fail in those cases.  Fallback to
using software events to generate nmi_watchdog traffic with
local apic interrupts.

Tested on a Pentium4 and it worked as expected, excepting for
detecting cpu lockups.

The problem with using software events as a cpu lock up detector
is the nmi_watchdog uses the logic that if local apic interrupts
stop incrementing then the cpu is probably locked up.  But with
software events we use the local apic to trigger the
nmi_watchdog callback to see if local apic interrupts are still
firing, which obviously they are otherwise we wouldn't have been
triggered.

The algorithm to detect cpu lock ups is the same as the old
nmi_watchdog. Perhaps we need to find a better way to detect
lock ups?

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
LKML-Reference: <1266013161-31197-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-14 09:19:44 +01:00
Don Zickus
504d7cf10e nmi_watchdog: Compile and portability fixes
The original patch was x86_64 centric.  Changed the code to make
it less so.

ested by building and running on a powerpc.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: peterz@infradead.org
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
LKML-Reference: <1266013161-31197-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-14 09:19:43 +01:00
Suresh Siddha
2225a122ae ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET
Generic support for PTRACE_GETREGSET/PTRACE_SETREGSET commands which
export the regsets supported by each architecture using the correponding
NT_* types. These NT_* types are already part of the userland ABI, used
in representing the architecture specific register sets as different NOTES
in an ELF core file.

'addr' parameter for the ptrace system call encode the REGSET type (using
the corresppnding NT_* type) and the 'data' parameter points to the
struct iovec having the user buffer and the length of that buffer.

	struct iovec iov = { buf, len};
	ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, &iov);

On successful completion, iov.len will be updated by the kernel specifying
how much the kernel has written/read to/from the user's iov.buf.

x86 extended state registers are primarily exported using this interface.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100211195614.886724710@sbs-t61.sc.intel.com>
Acked-by: Hongjiu Lu <hjl.tools@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-11 15:08:33 -08:00
Li Zefan
c7c6b1fe9f ftrace: Allow to remove a single function from function graph filter
I don't see why we can only clear all functions from the filter.

After patching:

  # echo sys_open > set_graph_function
  # echo sys_close >> set_graph_function
  # cat set_graph_function
  sys_open
  sys_close
  # echo '!sys_close' >> set_graph_function
  # cat set_graph_function
  sys_open

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B726388.2000408@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-11 14:32:38 -05:00
Jean Delvare
5c42dc7070 devres/irq: Fix devm_irq_match comment
Fix the reference (in comment).

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-11 16:01:02 +01:00
Yinghai Lu
e9a0064ad0 x86: Change range end to start+size
So make interface more consistent with early_res.
Later we can share some code with early_res.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-10-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-10 17:47:17 -08:00
Yinghai Lu
27811d8cab x86: Move range related operation to one file
We have almost the same code for mtrr cleanup and amd_bus checkup, and
this code  will also be used in replacing bootmem with early_res,
so try to move them together and reuse it from different parts.

Also rename update_range to subtract_range as that is what the
function is actually doing.

-v2: update comments as Christoph requested

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-4-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-10 17:47:17 -08:00
H. Peter Anvin
84abd88a70 Merge remote branch 'linus/master' into x86/bootmem 2010-02-10 16:55:28 -08:00
Brandon Phiilps
ced5b697a7 x86: Avoid race condition in pci_enable_msix()
Keep chip_data in create_irq_nr and destroy_irq.

When two drivers are setting up MSI-X at the same time via
pci_enable_msix() there is a race.  See this dmesg excerpt:

[   85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[   85.170611]   alloc irq_desc for 99 on node -1
[   85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[   85.170614]   alloc kstat_irqs on node -1
[   85.170616] alloc irq_2_iommu on node -1
[   85.170617]   alloc irq_desc for 100 on node -1
[   85.170619]   alloc kstat_irqs on node -1
[   85.170621] alloc irq_2_iommu on node -1
[   85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[   85.170626]   alloc irq_desc for 101 on node -1
[   85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[   85.170630]   alloc kstat_irqs on node -1
[   85.170631] alloc irq_2_iommu on node -1
[   85.170635]   alloc irq_desc for 102 on node -1
[   85.170636]   alloc kstat_irqs on node -1
[   85.170639] alloc irq_2_iommu on node -1
[   85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

	cfg_new = irq_desc_ptrs[102]->chip_data;
	if (cfg_new->vector != 0)
		continue;

This hits the NULL deref.

Another possible race exists via pci_disable_msix() in a driver or in
the number of error paths that call free_msi_irqs():

destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;

Remove the save and restore code for cfg in create_irq_nr() and
destroy_irq() and take the desc->lock when checking the irq_cfg.

Reported-and-analyzed-by: Brandon Philips <bphilips@suse.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Brandon Phililps <bphilips@suse.de>
Cc: stable@kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-10 14:27:28 -08:00
Steven Rostedt
ede55c9d78 tracing: Add correct/incorrect to sort keys for branch annotation output
The branch annotation is a bit difficult to see the worst offenders
because it only sorts by percentage:

 correct incorrect  %        Function                  File              Line
 ------- ---------  -        --------                  ----              ----
       0      163 100 qdisc_restart                  sch_generic.c        179
       0      163 100 pfifo_fast_dequeue             sch_generic.c        447
       0        4 100 pskb_trim_rcsum                skbuff.h             1689
       0        4 100 llc_rcv                        llc_input.c          170
       0       18 100 psmouse_interrupt              psmouse-base.c       304
       0        3 100 atkbd_interrupt                atkbd.c              389
       0        5 100 usb_alloc_dev                  usb.c                437
       0       11 100 vsscanf                        vsprintf.c           1897
       0        2 100 IS_ERR                         err.h                34
       0       23 100 __rmqueue_fallback             page_alloc.c         865
       0        4 100 probe_wakeup_sched_switch      trace_sched_wakeup.c 142
       0        3 100 move_masked_irq                migration.c          11

Adding the incorrect and correct values as sort keys makes this file a
bit more informative:

 correct incorrect  %        Function                  File              Line
 ------- ---------  -        --------                  ----              ----
       0   366541 100 audit_syscall_entry            auditsc.c            1637
       0   366538 100 audit_syscall_exit             auditsc.c            1685
       0   115839 100 sched_info_switch              sched_stats.h        269
       0    74567 100 sched_info_queued              sched_stats.h        222
       0    66578 100 sched_info_dequeued            sched_stats.h        177
       0    15113 100 trace_workqueue_insertion      workqueue.h          38
       0    15107 100 trace_workqueue_execution      workqueue.h          45
       0     3622 100 syscall_trace_leave            ptrace.c             1772
       0     2750 100 sched_move_task                sched.c              10100
       0     2750 100 sched_move_task                sched.c              10110
       0     1815 100 pre_schedule_rt                sched_rt.c           1462
       0      837 100 audit_alloc                    auditsc.c            879
       0      814 100 tcp_mss_split_point            tcp_output.c         1302

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-09 21:35:05 -05:00
Jason Wang
c93d89f3db Export the symbol of getboottime and mmonotonic_to_bootbased
Export getboottime and monotonic_to_bootbased in order to let them
could be used by following patch.

Cc: stable@kernel.org
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-02-09 19:20:15 +02:00
Anton Blanchard
301ba0457f kthread, sched: Remove reference to kthread_create_on_cpu
kthread_create_on_cpu doesn't exist so update a comment in
kthread.c to reflect this.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100209040740.GB3702@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-09 11:47:39 +01:00
Anton Blanchard
b80109e256 Remove reference to kthread_create_on_cpu
kthread_create_on_cpu doesn't exist so update a comment in kthread.c to reflect
this.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-09 11:28:48 +01:00
Al Viro
cccc6bba3f Lose the first argument of audit_inode_child()
it's always equal to ->d_name.name of the second argument

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-08 14:38:36 -05:00
Anton Blanchard
fa535a77bd sched: cpuacct: Use bigger percpu counter batch values for stats counters
When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are
enabled we can call cpuacct_update_stats with values much larger
than percpu_counter_batch.  This means the call to
percpu_counter_add will always add to the global count which is
protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch
value by cputime_one_jiffy such that we have the same batch
limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled.
His patch did this once at boot but that initialisation happened
too early on PowerPC (before time_init) and it was never updated
at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by
cputime_one_jiffy at runtime, which keeps the batch correct even
after cpu hotplug operations.  We cap it at INT_MAX in case of
overflow.

For architectures that do not support
CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1
and gcc is smart enough to optimise min(s32
percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
least on x86 and PowerPC.  So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark
is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT
disabled kernel:

 CONFIG_CGROUP_CPUACCT disabled:   16906698 ctx switches/sec
 CONFIG_CGROUP_CPUACCT enabled:       61720 ctx switches/sec
 CONFIG_CGROUP_CPUACCT + patch:	   16663217 ctx switches/sec

Tested with:

 wget http://ozlabs.org/~anton/junkcode/context_switch.c
 make context_switch
 for i in `seq 0 63`; do taskset -c $i ./context_switch & done
 vmstat 1

Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:57:37 +01:00
Ingo Molnar
6d3e0907b8 Merge branch 'sched/urgent' into sched/core
Merge reason: Merge dependent fix, update to latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:55:46 +01:00
Andrew Morton
50200df462 kernel/sched.c: Suppress unused var warning
On UP:

 kernel/sched.c: In function 'wake_up_new_task':
 kernel/sched.c:2631: warning: unused variable 'cpu'

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:53:19 +01:00
Don Zickus
84e478c6f1 nmi_watchdog: Config option to enable new nmi_watchdog
These are the bits that enable the new nmi_watchdog and safely
isolate the old nmi_watchdog.  Only one or the other can run,
not both at the same time.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: peterz@infradead.org
LKML-Reference: <1265424425-31562-4-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:29:03 +01:00
Don Zickus
1fb9d6ad27 nmi_watchdog: Add new, generic implementation, using perf events
This is a new generic nmi_watchdog implementation using the perf
events infrastructure as suggested by Ingo.

The implementation is simple, just create an in-kernel perf
event and register an overflow handler to check for cpu lockups.

I created a generic implementation that lives in kernel/ and
the hardware specific part that for now lives in arch/x86.

This approach has a number of advantages:

 - It simplifies the x86 PMU implementation in the long run,
   in that it removes the hardcoded low-level PMU implementation
   that was the NMI watchdog before.

 - It allows new NMI watchdog features to be added in a central
   place.

 - It allows other architectures to enable the NMI watchdog,
   as long as they have perf events (that provide NMIs)
   implemented.

 - It also allows for more graceful co-existence of existing
   perf events apps and the NMI watchdog - before these changes
   the relationship was exclusive. (The NMI watchdog will 'spend'
   a perf event when enabled. In later iterations we might be
   able to piggyback from an existing NMI event without having
   to allocate a hardware event for the NMI watchdog - turning
   this into a no-hardware-cost feature.)

As for compatibility, we'll keep the old NMI watchdog code as
well until the new one can 100% replace it on all CPUs, old and
new alike.  That might take some time as the NMI watchdog has
been ported to many CPU models.

I have done light testing to make sure the framework works
correctly and it does.

 v2: Set the correct timeout values based on the old nmi
     watchdog

Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: gorcunov@gmail.com
Cc: aris@redhat.com
Cc: peterz@infradead.org
LKML-Reference: <1265424425-31562-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-08 08:29:02 +01:00
H Hartley Sweeten
6622e670b2 posix-timers.c: Don't export local functions
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-05 14:54:10 +01:00
Magnus Damm
c54a42b19f clocksource: add suspend callback
Add a clocksource suspend callback.  This callback can be used by the
clocksource driver to shutdown and perform any kind of late suspend
activities even though the clocksource driver itself is a non-sysdev
driver.

One example where this is useful is to fix the sh_cmt.c platform driver
that today suspends using the platform bus and shuts down the clocksource
too early.

With this callback in place the sh_cmt driver will suspend using the
clocksource and clockevent hooks and leave the platform device pm
callbacks unused.

Signed-off-by: Magnus Damm <damm@opensource.se>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-05 14:54:10 +01:00
Magnus Damm
17622339af clocksource: add argument to resume callback
Pass the clocksource as an argument to the clocksource resume callback. 
Needed so we can point out which CMT channel the sh_cmt.c driver shall
resume.

Signed-off-by: Magnus Damm <damm@opensource.se>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-05 14:54:10 +01:00
Daniel Mack
1537a3638c tree-wide: fix 'lenght' typo in comments and code
Some misspelled occurences of 'octet' and some comments were also fixed
as I was on it.

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:45 +01:00
Baruch Siach
9ce8e498ee devres: typo fix s/dev/devm/
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:43 +01:00
Edward Z. Yang
350f82586b Remove redundant trailing semicolons from macros
Signed-off-by: Edward Z. Yang <ezyang@ksplice.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:43 +01:00
Thadeu Lima de Souza Cascardo
af66585270 fix comment typo boo -> boot in ksysfs.c
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:37 +01:00
Adam Buchbinder
c9404c9c39 Fix misspelling of "should" and "shouldn't" in comments.
Some comments misspell "should" or "shouldn't"; this fixes them. No code changes.

Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:30 +01:00
Masami Hiramatsu
5ecaafdbf4 kprobes: Add mcount to the kprobes blacklist
Since mcount function can be called from everywhere,
it should be blacklisted. Moreover, the "mcount" symbol
is a special symbol name. So, it is better to put it in
the generic blacklist.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-05 08:13:57 +01:00
Linus Torvalds
aa16cd8d12 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: Handle futex value corruption gracefully
  futex: Handle user space corruption gracefully
  futex_lock_pi() key refcnt fix
  softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume
2010-02-04 16:07:41 -08:00
Adam Buchbinder
2a61aa4016 Fix misspellings of "invocation" in comments.
Some comments misspell "invocation"; this fixes them. No code
changes.

Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-04 11:55:45 +01:00
Adam Buchbinder
c41b20e721 Fix misspellings of "truly" in comments.
Some comments misspell "truly"; this fixes them. No code changes.

Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-04 11:55:45 +01:00
Peter Zijlstra
9717e6cd3d perf_events: Optimize perf_event_task_tick()
Pretty much all of the calls do perf_disable/perf_enable cycles, pull
that out to cut back on hardware programming.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-04 09:59:49 +01:00
Yong Zhang
2357725695 sched: Remove member rt_se from struct rt_rq
It's a duplicate of tg->rt_se[cpu] and the only usage is
sched_rt_rq_dequeue() and sched_rt_rq_enqueue(). After the
first patch to those two function. rt_se can be removed.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001282258q38781619u653ca4a7dd267347@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-04 09:57:33 +01:00
Yong Zhang
74b7eb5885 sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu]
This is the first step to remove rt_rq member rt_se because it have the
same meaning with tg->rt_se[cpu]. And the latter style is also used by
the fair scheduling class.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001282257r28c97a92o9f90cf16fe8d3d84@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-04 09:57:32 +01:00
Masami Hiramatsu
f24bb999d2 ftrace: Remove record freezing
Remove record freezing. Because kprobes never puts probe on
ftrace's mcount call anymore, it doesn't need ftrace to check
whether kprobes on it.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100202214925.4694.73469.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-04 09:36:19 +01:00
Masami Hiramatsu
4554dbcb85 kprobes: Check probe address is reserved
Check whether the address of new probe is already reserved by
ftrace or alternatives (on x86) when registering new probe.
If reserved, it returns an error and not register the probe.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <20100202214918.4694.94179.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-04 09:36:19 +01:00
Masami Hiramatsu
2cfa19780d ftrace/alternatives: Introducing *_text_reserved functions
Introducing *_text_reserved functions for checking the text
address range is partially reserved or not. This patch provides
checking routines for x86 smp alternatives and dynamic ftrace.
Since both functions modify fixed pieces of kernel text, they
should reserve and protect those from other dynamic text
modifier, like kprobes.

This will also be extended when introducing other subsystems
which modify fixed pieces of kernel text. Dynamic text modifiers
should avoid those.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: przemyslaw@pawelczyk.it
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <20100202214911.4694.16587.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-04 09:36:19 +01:00
Masami Hiramatsu
615d0ebbc7 kprobes: Disable booster when CONFIG_PREEMPT=y
Disable kprobe booster when CONFIG_PREEMPT=y at this time,
because it can't ensure that all kernel threads preempted on
kprobe's boosted slot run out from the slot even using
freeze_processes().

The booster on preemptive kernel will be resumed if
synchronize_tasks() or something like that is introduced.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100202214904.4694.24330.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-04 09:36:18 +01:00
Kees Cook
d78ca3cd73 syslog: use defined constants instead of raw numbers
Right now the syslog "type" action are just raw numbers which makes
the source difficult to follow.  This patch replaces the raw numbers
with defined constants for some level of sanity.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04 14:20:41 +11:00
Kees Cook
002345925e syslog: distinguish between /proc/kmsg and syscalls
This allows the LSM to distinguish between syslog functions originating
from /proc/kmsg access and direct syscalls.  By default, the commoncaps
will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg
file descriptor.  For example the kernel syslog reader can now drop
privileges after opening /proc/kmsg, instead of staying privileged with
CAP_SYS_ADMIN.  MAC systems that implement security_syslog have unchanged
behavior.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04 14:20:12 +11:00
Mahesh Salgaonkar
cd757645fb perf: Make bp_len type to u64 generic across the arch
Change 'bp_len' type to __u64 to make it work across archs as
the s390 architecture watch point length can be upto 2^64.

reference:
	http://lkml.org/lkml/2010/1/25/212

This is an ABI change that is not backward compatible with
the previous hardware breakpoint info layout integrated in this
development cycle, a rebuilt of perf tools is necessary for
versions based on 2.6.33-rc1 - 2.6.33-rc6 to work with a
kernel based on this patch.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin <schwidefsky@de.ibm.com>
LKML-Reference: <20100130045518.GA20776@in.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-02-04 01:07:12 +01:00
Peter Zijlstra
b9c3032277 hrtimer, softirq: Fix hrtimer->softirq trampoline
hrtimers callbacks are always done from hardirq context, either the
jiffy tick interrupt or the hrtimer device interrupt.

[ there is currently one exception that can still call a hrtimer
  callback from softirq, but even in that case this will still
  work correctly. ]

Reported-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yury Polyanskiy <ypolyans@princeton.edu>
Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
LKML-Reference: <1265120401.24455.306.camel@laptop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-03 18:17:40 +01:00
Thomas Gleixner
59647b6ac3 futex: Handle futex value corruption gracefully
The WARN_ON in lookup_pi_state which complains about a mismatch
between pi_state->owner->pid and the pid which we retrieved from the
user space futex is completely bogus.

The code just emits the warning and then continues despite the fact
that it detected an inconsistent state of the futex. A conveniant way
for user space to spam the syslog.

Replace the WARN_ON by a consistency check. If the values do not match
return -EINVAL and let user space deal with the mess it created.

This also fixes the missing task_pid_vnr() when we compare the
pi_state->owner pid with the futex value.

Reported-by: Jermome Marchand <jmarchan@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
2010-02-03 15:13:22 +01:00
Thomas Gleixner
51246bfd18 futex: Handle user space corruption gracefully
If the owner of a PI futex dies we fix up the pi_state and set
pi_state->owner to NULL. When a malicious or just sloppy programmed
user space application sets the futex value to 0 e.g. by calling
pthread_mutex_init(), then the futex can be acquired again. A new
waiter manages to enqueue itself on the pi_state w/o damage, but on
unlock the kernel dereferences pi_state->owner and oopses.

Prevent this by checking pi_state->owner in the unlock path. If
pi_state->owner is not current we know that user space manipulated the
futex value. Ignore the mess and return -EINVAL.

This catches the above case and also the case where a task hijacks the
futex by setting the tid value and then tries to unlock it.

Reported-by: Jermome Marchand <jmarchan@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
2010-02-03 15:13:22 +01:00
Mikael Pettersson
5ecb01cfdf futex_lock_pi() key refcnt fix
This fixes a futex key reference count bug in futex_lock_pi(),
where a key's reference count is incremented twice but decremented
only once, causing the backing object to not be released.

If the futex is created in a temporary file in an ext3 file system,
this bug causes the file's inode to become an "undead" orphan,
which causes an oops from a BUG_ON() in ext3_put_super() when the
file system is unmounted. glibc's test suite is known to trigger this,
see <http://bugzilla.kernel.org/show_bug.cgi?id=14256>.

The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's
38d47c1b70 "[PATCH] futex: rely on
get_user_pages() for shared futexes". That commit made get_futex_key()
also increment the reference count of the futex key, and updated its
callers to decrement the key's reference count before returning.
Unfortunately the normal exit path in futex_lock_pi() wasn't corrected:
the reference count is incremented by get_futex_key() and queue_lock(),
but the normal exit path only decrements once, via unqueue_me_pi().
The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31
this is easily done by 'goto out_put_key' rather than 'goto out'.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
2010-02-03 15:13:22 +01:00
Linus Torvalds
c80d292f13 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  kernel/cred.c: use kmem_cache_free
2010-02-02 18:12:22 -08:00
Li Zefan
4528fd0595 cgroups: fix to return errno in a failure path
In cgroup_create(), if alloc_css_id() returns failure, the errno is not
propagated to userspace, so mkdir will fail silently.

To trigger this bug, we mount blkio (or memory subsystem), and create more
then 65534 cgroups.  (The number of cgroups is limited to 65535 if a
subsystem has use_id == 1)

 # mount -t cgroup -o blkio xxx /mnt
 # for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done
 # mkdir /mnt/65534
 (should return ENOSPC)
 #

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 18:11:22 -08:00
Randy Dunlap
bc173f7092 kfifo: fix kernel-doc notation
Fix kfifo kernel-doc warnings:

Warning(kernel/kfifo.c:361): No description found for parameter 'total'
Warning(kernel/kfifo.c:402): bad line:  @ @lenout: pointer to output variable with copied data
Warning(kernel/kfifo.c:412): No description found for parameter 'lenout'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 18:11:21 -08:00
Julia Lawall
b8a1d37c5f kernel/cred.c: use kmem_cache_free
Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather
than kfree.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x,E,c;
@@

 x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
 ... when != x = E
     when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Steve Dickson <steved@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Morris <jmorris@namei.org>
2010-02-03 10:21:57 +11:00
Lai Jiangshan
4f48f8b7fd tracing: Fix circular dead lock in stack trace
When we cat <debugfs>/tracing/stack_trace, we may cause circular lock:
sys_read()
  t_start()
     arch_spin_lock(&max_stack_lock);

  t_show()
     seq_printf(), vsnprintf() .... /* they are all trace-able,
       when they are traced, max_stack_lock may be required again. */

The following script can trigger this circular dead lock very easy:
#!/bin/bash

echo 1 > /proc/sys/kernel/stack_tracer_enabled

mount -t debugfs xxx /mnt > /dev/null 2>&1

(
# make check_stack() zealous to require max_stack_lock
for ((; ;))
{
	echo 1 > /mnt/tracing/stack_max_size
}
) &

for ((; ;))
{
	cat /mnt/tracing/stack_trace > /dev/null
}

To fix this bug, we increase the percpu trace_active before
require the lock.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B67D4F9.9080905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-02-02 10:20:18 -05:00
Peter Zijlstra
4a461c85b6 sched: Remove unused update_shares_locked()
Commit f492e12ef0 ("sched: Remove
load_balance_newidle()") removed the only user of this function,
so remove it too.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1265019219.24455.128.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-02 06:58:27 +01:00
Akinobu Mita
90fdbdb484 sched: Use for_each_bit
No change in functionality.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1264938810-4173-1-git-send-email-akinobu.mita@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-02 06:58:27 +01:00
Tejun Heo
ab386128f2 Merge branch 'master' into percpu 2010-02-02 14:38:15 +09:00
Andrew Morton
e527300715 Generic page_is_ram: use __weak
Use __weak instead of __attribute__((weak)).

Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-01 16:58:17 -08:00
Wu Fengguang
61ef2489db resources: introduce generic page_is_ram()
It's based on walk_system_ram_range(), for archs that don't have
their own page_is_ram().

The static verions in MIPS and SCORE are also made global.

v4: prefer plain 1 instead of PAGE_IS_RAM (H. Peter Anvin)
v3: add comment (KAMEZAWA Hiroyuki)
    "AFAIK, this "System RAM" information has been used for kdump to
    grab valid memory area and seems good for the kernel itself."
v2: add PAGE_IS_RAM macro (Américo Wang)

Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Cc: linux-mips@linux-mips.org
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <20100122081619.GA6431@localhost>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-01 16:58:17 -08:00
Linus Torvalds
e20da89130 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: Fix check_usage_backwards() error message
2010-02-01 10:45:26 -08:00
Linus Torvalds
834db333ed Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger
  x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API
  hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails.
  perf: Ignore perf.data.old
  perf report: Fix segmentation fault when running with '-g none'
2010-02-01 10:45:00 -08:00
Linus Torvalds
8ea85c2817 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Correct printk whitespace in warning from cpu down task check
  sched: Fix incorrect sanity check
  sched: Fix fork vs hotplug vs cpuset namespaces
2010-02-01 10:44:36 -08:00
Linus Torvalds
bdd8466783 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  clocksource: Prevent potential kgdb dead lock
2010-02-01 10:44:06 -08:00
Jason Wessel
d6ad3e286d softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume
When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets
the time from hardware such as the TSC on x86. In this
configuration kgdb will report a softlock warning message on
resuming or detaching from a debug session.

Sequence of events in the problem case:

 1) "cpu sched clock" and "hardware time" are at 100 sec prior
    to a call to kgdb_handle_exception()

 2) Debugger waits in kgdb_handle_exception() for 80 sec and on
    exit the following is called ...  touch_softlockup_watchdog() -->
    __raw_get_cpu_var(touch_timestamp) = 0;

 3) "cpu sched clock" = 100s (it was not updated, because the
    interrupt was disabled in kgdb) but the "hardware time" = 180 sec

 4) The first timer interrupt after resuming from
    kgdb_handle_exception updates the watchdog from the "cpu sched clock"

update_process_times() { ...  run_local_timers() -->
softlockup_tick() --> check (touch_timestamp == 0) (it is "YES"
here, we have set "touch_timestamp = 0" at kgdb) -->
__touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp"
to "get_timestamp()" (Here, the "touch_timestamp" will still be
set to 100s.)  ...

    scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched
    clock" to "hardware time" = 180s) ...  }

 5) The Second timer interrupt handler appears to have a large
    jump and trips the softlockup warning.

update_process_times() { ...  run_local_timers() -->
softlockup_tick() --> "cpu sched clock" - "touch_timestamp" =
180s-100s > 60s --> printk "soft lockup error messages" ...  }

note: ***(A) reset "touch_timestamp" to
"get_timestamp(this_cpu)"

Why is "touch_timestamp" 100 sec, instead of 180 sec?

When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of
get_timestamp() is:

get_timestamp(this_cpu)
 -->cpu_clock(this_cpu)
 -->sched_clock_cpu(this_cpu)
 -->__update_sched_clock(sched_clock_data, now)

The __update_sched_clock() function uses the GTOD tick value to
create a window to normalize the "now" values.  So if "now"
value is too big for sched_clock_data, it will be ignored.

The fix is to invoke sched_clock_tick() to update "cpu sched
clock" in order to recover from this state.  This is done by
introducing the function touch_softlockup_watchdog_sync(). This
allows kgdb to request that the sched clock is updated when the
watchdog thread runs the first time after a resume from kgdb.

[yong.zhang0@gmail.com: Use per cpu instead of an array]
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: peterz@infradead.org
LKML-Reference: <1264631124-4837-2-git-send-email-jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-01 08:22:32 +01:00
Jason Wessel
5352ae638e perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger
This patch fixes the regression in functionality where the
kernel debugger and the perf API do not nicely share hw
breakpoint reservations.

The kernel debugger cannot use any mutex_lock() calls because it
can start the kernel running from an invalid context.

A mutex free version of the reservation API needed to get
created for the kernel debugger to safely update hw breakpoint
reservations.

The possibility for a breakpoint reservation to be concurrently
processed at the time that kgdb interrupts the system is
improbable. Should this corner case occur the end user is
warned, and the kernel debugger will prohibit updating the
hardware breakpoint reservations.

Any time the kernel debugger reserves a hardware breakpoint it
will be a system wide reservation.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: torvalds@linux-foundation.org
LKML-Reference: <1264719883-7285-3-git-send-email-jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-30 08:42:21 +01:00
Jason Wessel
cc0967490c x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API
In the 2.6.33 kernel, the hw_breakpoint API is now used for the
performance event counters.  The hw_breakpoint_handler() now
consumes the hw breakpoints that were previously set by kgdb
arch specific code.  In order for kgdb to work in conjunction
with this core API change, kgdb must use some of the low level
functions of the hw_breakpoint API to install, uninstall, and
deal with hw breakpoint reservations.

The kgdb core required a change to call kgdb_disable_hw_debug
anytime a slave cpu enters kgdb_wait() in order to keep all the
hw breakpoints in sync as well as to prevent hitting a hw
breakpoint while kgdb is active.

During the architecture specific initialization of kgdb, it will
pre-allocate 4 disabled (struct perf event **) structures.  Kgdb
will use these to manage the capabilities for the 4 hw
breakpoint registers, per cpu.  Right now the hw_breakpoint API
does not have a way to ask how many breakpoints are available,
on each CPU so it is possible that the install of a breakpoint
might fail when kgdb restores the system to the run state.  The
intent of this patch is to first get the basic functionality of
hw breakpoints working and leave it to the person debugging the
kernel to understand what hw breakpoints are in use and what
restrictions have been imposed as a result.  Breakpoint
constraints will be dealt with in a future patch.

While atomic, the x86 specific kgdb code will call
arch_uninstall_hw_breakpoint() and arch_install_hw_breakpoint()
to manage the cpu specific hw breakpoints.

The net result of these changes allow kgdb to use the same pool
of hw_breakpoints that are used by the perf event API, but
neither knows about future reservations for the available hw
breakpoint slots.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: torvalds@linux-foundation.org
LKML-Reference: <1264719883-7285-2-git-send-email-jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-30 08:42:20 +01:00
Ingo Molnar
ae7f6711d6 Merge branch 'perf/urgent' into perf/core
Merge reason: We want to queue up a dependent patch. Also update to
              later -rc's.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-29 10:36:22 +01:00
John Stultz
7e1b584774 ntp: Cleanup xtime references in ntp.c
ntp.c doesn't need to access timekeeping internals directly, so change
xtime references to use the get_seconds() timekeeping interface.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: richard@rsk.demon.co.uk
LKML-Reference: <1264738844-21935-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-01-29 10:15:19 +01:00
john stultz
1f5b8f8a20 ntp: Make time_esterror and time_maxerror static
Make time_esterror and time_maxerror static as no one uses them
outside of ntp.c
    
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: richard@rsk.demon.co.uk
LKML-Reference: <1264719761.3437.47.camel@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-01-29 10:15:19 +01:00
Peter Zijlstra
75c9f3284a perf_events: Fix sample_period transfer on inherit
One problem with frequency driven counters is that we cannot
predict the rate at which they trigger, therefore we have to
start them at period=1, this causes a ramp up effect. However,
if we fail to propagate the stable state on fork each new child
will have to ramp up again. This can lead to significant
artifacts in sample data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: eranian@google.com
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1264752266.4283.2121.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-29 09:15:26 +01:00
Xiao Guangrong
1e12a4a7a3 tracing/kprobe: Cleanup unused return value of tracing functions
The return values of the kprobe's tracing functions are meaningless,
lets remove these.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <4B60E9A3.2040505@cn.fujitsu.com>
[fweisbec@gmail: whitespace fixes, drop useless void returns in end
of functions]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-01-29 02:14:40 +01:00
Xiao Guangrong
430ad5a600 perf: Factorize trace events raw sample buffer operations
Introduce ftrace_perf_buf_prepare() and ftrace_perf_buf_submit() to
gather the common code that operates on raw events sampling buffer.
This cleans up redundant code between regular trace events, syscall
events and kprobe events.

Changelog v1->v2:
- Rename function name as per Masami and Frederic's suggestion
- Add __kprobes for ftrace_perf_buf_prepare() and make
  ftrace_perf_buf_submit() inline as per Masami's suggestion
- Export ftrace_perf_buf_prepare since modules will use it

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <4B60E92D.9000808@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-01-29 02:02:57 +01:00
Lai Jiangshan
ea2c68a08f tracing: Simplify test for function_graph tracing start point
In the function graph tracer, a calling function is to be traced
only when it is enabled through the set_graph_function file,
or when it is nested in an enabled function.

Current code uses TSK_TRACE_FL_GRAPH to test whether it is nested
or not. Looking at the code, we can get this:
(trace->depth > 0) <==> (TSK_TRACE_FL_GRAPH is set)

trace->depth is more explicit to tell that it is nested.
So we use trace->depth directly and simplify the code.

No functionality is changed.
TSK_TRACE_FL_GRAPH is not removed yet, it is left for future usage.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B4DB0B6.7040607@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-01-29 01:05:12 +01:00
Mahesh Salgaonkar
b23ff0e933 hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails.
On a given architecture, when hardware breakpoint registration fails
due to un-supported access type (read/write/execute), we lose the bp
slot since register_perf_hw_breakpoint() does not release the bp slot
on failure.
Hence, any subsequent hardware breakpoint registration starts failing
with 'no space left on device' error.

This patch introduces error handling in register_perf_hw_breakpoint()
function and releases bp slot on error.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: K. Prasad <prasad@linux.vnet.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
LKML-Reference: <20100121125516.GA32521@in.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-01-28 14:15:51 +01:00
Frans Pop
9d3cfc4c1d sched: Correct printk whitespace in warning from cpu down task check
Due to an incorrect line break the output currently contains tabs.
Also remove trailing space.

The actual output that logcheck sent me looked like this:
 Task events/1 (pid = 10) is on cpu 1^I^I^I^I(state = 1, flags = 84208040)

After this patch it becomes:
 Task events/1 (pid = 10) is on cpu 1 (state = 1, flags = 84208040)

Signed-off-by: Frans Pop <elendilplanet.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <201001251456.34996.elendil@planet.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-28 06:59:55 +01:00
Peter Zijlstra
11854247e2 sched: Fix incorrect sanity check
We moved to migrate on wakeup, which means that sleeping tasks could
still be present on offline cpus. Amend the check to only test running
tasks.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-28 06:59:51 +01:00
Peter Zijlstra
abd5071394 perf: Reimplement frequency driven sampling
There was a bug in the old period code that caused intel_pmu_enable_all()
or native_write_msr_safe() to show up quite high in the profiles.

In staring at that code it made my head hurt, so I rewrote it in a
hopefully simpler fashion. Its now fully symetric between tick and
overflow driven adjustments and uses less data to boot.

The only complication is that it basically wants to do a u128 division.
The code approximates that in a rather simple truncate until it fits
fashion, taking care to balance the terms while truncating.

This version does not generate that sampling artefact.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-27 08:39:33 +01:00
Oleg Nesterov
48d5067417 lockdep: Fix check_usage_backwards() error message
Lockdep has found the real bug, but the output doesn't look right to me:

> =========================================================
> [ INFO: possible irq lock inversion dependency detected ]
> 2.6.33-rc5 #77
> ---------------------------------------------------------
> emacs/1609 just changed the state of lock:
>  (&(&tty->ctrl_lock)->rlock){+.....}, at: [<ffffffff8127c648>] tty_fasync+0xe8/0x190
> but this lock took another, HARDIRQ-unsafe lock in the past:
>  (&(&sighand->siglock)->rlock){-.....}

"HARDIRQ-unsafe" and "this lock took another" looks wrong, afaics.

>   ... key      at: [<ffffffff81c054a4>] __key.46539+0x0/0x8
>   ... acquired at:
>    [<ffffffff81089af6>] __lock_acquire+0x1056/0x15a0
>    [<ffffffff8108a0df>] lock_acquire+0x9f/0x120
>    [<ffffffff81423012>] _raw_spin_lock_irqsave+0x52/0x90
>    [<ffffffff8127c1be>] __proc_set_tty+0x3e/0x150
>    [<ffffffff8127e01d>] tty_open+0x51d/0x5e0

The stack-trace shows that this lock (ctrl_lock) was taken under
->siglock (which is hopefully irq-safe).

This is a clear typo in check_usage_backwards() where we tell the print a
fancy routine we're forwards.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100126181641.GA10460@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-27 08:34:02 +01:00
Mike Frysinger
0368897034 tracing/documentation: Cover new frame pointer semantics
Update the graph tracer examples to cover the new frame pointer semantics
(in terms of passing it along).  Move the HAVE_FUNCTION_GRAPH_FP_TEST docs
out of the Kconfig, into the right place, and expand on the details.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
LKML-Reference: <1264165967-18938-1-git-send-email-vapier@gentoo.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-26 17:00:39 -05:00
Steven Rostedt
3c05d74827 ring-buffer: Check for end of page in iterator
If the iterator comes to an empty page for some reason, or if
the page is emptied by a consuming read. The iterator code currently
does not check if the iterator is pass the contents, and may
return a false entry.

This patch adds a check to the ring buffer iterator to test if the
current page has been completely read and sets the iterator to the
next page if necessary.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-26 16:14:08 -05:00
Steven Rostedt
492a74f421 ring-buffer: Check if ring buffer iterator has stale data
Usually reads of the ring buffer is performed by a single task.
There are two types of reads from the ring buffer.

One is a consuming read which will consume the entry that was read
and the next read will be the entry that follows.

The other is an iterator that will let the user read the contents of
the ring buffer without modifying it. When an iterator is allocated,
writes to the ring buffer are disabled to protect the iterator.

The problem exists when consuming reads happen while an iterator is
allocated. Specifically, the kind of read that swaps out an entire
page (used by splice) and replaces it with a new read. If the iterator
is on the page that is swapped out, then the next read may read
from this swapped out page and return garbage.

This patch adds a check when reading the iterator to make sure that
the iterator contents are still valid. If a consuming read has taken
place, the iterator is reset.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-26 16:09:30 -05:00
Thomas Gleixner
7b7422a566 clocksource: Prevent potential kgdb dead lock
commit 0f8e8ef7 (clocksource: Simplify clocksource watchdog resume
logic) introduced a potential kgdb dead lock. When the kernel is
stopped by kgdb inside code which holds watchdog_lock then kgdb dead
locks in clocksource_resume_watchdog().

clocksource_resume_watchdog() is called from kbdg via
clocksource_touch_watchdog() to avoid that the clock source watchdog
marks TSC unstable after the kernel has been stopped.

Solve this by replacing spin_lock with a spin_trylock and just return
in case the lock is held. Not resetting the watchdog might result in
TSC becoming marked unstable, but that's an acceptable penalty for
using kgdb.

The timekeeping is anyway easily screwed up by kgdb when the system
uses either jiffies or a clock source which wraps in short intervals
(e.g. pm_timer wraps about every 4.6s), so we really do not have to
worry about that occasional TSC marked unstable side effect.

The second caller of clocksource_resume_watchdog() is
clocksource_resume(). The trylock is safe here as well because the
system is UP at this point, interrupts are disabled and nothing else
can hold watchdog_lock().

Reported-by: Jason Wessel <jason.wessel@windriver.com>
LKML-Reference: <1264480000-6997-4-git-send-email-jason.wessel@windriver.com>
Cc: kgdb-bugreport@lists.sourceforge.net
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-01-26 14:53:16 +01:00
Steven Rostedt
74bf4076f2 tracing: Prevent kernel oops with corrupted buffer
If the contents of the ftrace ring buffer gets corrupted and the trace
file is read, it could create a kernel oops (usualy just killing the user
task thread). This is caused by the checking of the pid in the buffer.
If the pid is negative, it still references the cmdline cache array,
which could point to an invalid address.

The simple fix is to test for negative PIDs.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-25 15:11:53 -05:00
Linus Torvalds
f6760aa024 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  clockevent: Don't remove broadcast device when cpu is dead
2010-01-24 10:38:07 -08:00
Linus Torvalds
b8be634e01 Merge git://git.infradead.org/~dwmw2/mtd-2.6.33
* git://git.infradead.org/~dwmw2/mtd-2.6.33:
  mtd: tests: fix read, speed and stress tests on NOR flash
  mtd: Really add ARM pismo support
  kmsg_dump: Dump on crash_kexec as well
2010-01-24 10:31:34 -08:00
Thomas Gleixner
60db48cacb sched: Queue a deboosted task to the head of the RT prio queue
rtmutex_set_prio() is used to implement priority inheritance for
futexes. When a task is deboosted it gets enqueued at the tail of its
RT priority list. This is violating the POSIX scheduling semantics:

rt priority list X contains two runnable tasks A and B

task A	 runs with priority X and holds mutex M
task C	 preempts A and is blocked on mutex M 
     	 -> task A is boosted to priority of task C (Y)
task A	 unlocks the mutex M and deboosts itself
     	 -> A is dequeued from rt priority list Y
	 -> A is enqueued to the tail of rt priority list X
task C	 schedules away
task B	 runs

This is wrong as task A did not schedule away and therefor violates
the POSIX scheduling semantics.

Enqueue the task to the head of the priority list instead. 

Reported-by: Mathias Weber <mathias.weber.mw1@roche.com>
Reported-by: Carsten Emde <cbe@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Carsten Emde <cbe@osadl.org>
Tested-by: Mathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.809074113@linutronix.de>
2010-01-22 18:09:59 +01:00
Thomas Gleixner
37dad3fce9 sched: Implement head queueing for sched_rt
The ability of enqueueing a task to the head of a SCHED_FIFO priority
list is required to fix some violations of POSIX scheduling policy.

Implement the functionality in sched_rt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Carsten Emde <cbe@osadl.org>
Tested-by: Mathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.772169931@linutronix.de>
2010-01-22 18:09:59 +01:00
Thomas Gleixner
ea87bb7853 sched: Extend enqueue_task to allow head queueing
The ability of enqueueing a task to the head of a SCHED_FIFO priority
list is required to fix some violations of POSIX scheduling policy.

Extend the related functions with a "head" argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Carsten Emde <cbe@osadl.org>
Tested-by: Mathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.734886007@linutronix.de>
2010-01-22 18:09:59 +01:00
Peter Zijlstra
fabf318e5e sched: Fix fork vs hotplug vs cpuset namespaces
There are a number of issues:

1) TASK_WAKING vs cgroup_clone (cpusets)

copy_process():

  sched_fork()
    child->state = TASK_WAKING; /* waiting for wake_up_new_task() */
  if (current->nsproxy != p->nsproxy)
     ns_cgroup_clone()
       cgroup_clone()
         mutex_lock(inode->i_mutex)
         mutex_lock(cgroup_mutex)
         cgroup_attach_task()
	   ss->can_attach()
           ss->attach() [ -> cpuset_attach() ]
             cpuset_attach_task()
               set_cpus_allowed_ptr();
                 while (child->state == TASK_WAKING)
                   cpu_relax();
will deadlock the system.


2) cgroup_clone (cpusets) vs copy_process

So even if the above would work we still have:

copy_process():

  if (current->nsproxy != p->nsproxy)
     ns_cgroup_clone()
       cgroup_clone()
         mutex_lock(inode->i_mutex)
         mutex_lock(cgroup_mutex)
         cgroup_attach_task()
	   ss->can_attach()
           ss->attach() [ -> cpuset_attach() ]
             cpuset_attach_task()
               set_cpus_allowed_ptr();
  ...

  p->cpus_allowed = current->cpus_allowed

over-writing the modified cpus_allowed.


3) fork() vs hotplug

  if we unplug the child's cpu after the sanity check when the child
  gets attached to the task_list but before wake_up_new_task() shit
  will meet with fan.

Solve all these issues by moving fork cpu selection into
wake_up_new_task().

Reported-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1264106190.4283.1314.camel@laptop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-01-21 23:25:31 +01:00
Linus Torvalds
e80b135985 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: x86: Add support for the ANY bit
  perf: Change the is_software_event() definition
  perf: Honour event state for aux stream data
  perf: Fix perf_event_do_pending() fallback callsite
  perf kmem: Print usage help for unknown commands
  perf kmem: Increase "Hit" column length
  hw-breakpoints, perf: Fix broken mmiotrace due to dr6 by reference change
  perf timechart: Use tid not pid for COMM change
2010-01-21 08:50:04 -08:00
Peter Zijlstra
22e190851f perf: Honour event state for aux stream data
Anton reported that perf record kept receiving events even after calling
ioctl(PERF_EVENT_IOC_DISABLE). It turns out that FORK,COMM and MMAP
events didn't respect the disabled state and kept flowing in.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Anton Blanchard <anton@samba.org>
LKML-Reference: <1263459187.4244.265.camel@laptop>
CC: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:40 +01:00
Peter Zijlstra
fe432200ab perf: Fix perf_event_do_pending() fallback callsite
Paul questioned the context in which we should call
perf_event_do_pending(). After looking at that I found that it should be
called from IRQ context these days, however the fallback call-site is
placed in softirq context. Ammend this by placing the callback in the IRQ
timer path.

Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1263374859.4244.192.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:39 +01:00
Dhaval Giani
7c9414385e sched: Remove USER_SCHED
Remove the USER_SCHED feature. It has been scheduled to be removed in
2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2

Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1263990378.24844.3.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:18 +01:00
Gautham R Shenoy
871e35bc97 sched: Fix the place where group powers are updated
We want to update the sched_group_powers when balance_cpu == this_cpu.

Currently the group powers are updated only if the balance_cpu is the
first CPU in the local group. But balance_cpu = this_cpu could also be
the first idle cpu in the group. Hence fix the place where the group
powers are updated.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1264017764.5717.127.camel@jschopp-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:17 +01:00
Peter Zijlstra
8f190fb3f7 sched: Assume *balance is valid
Since all load_balance() callers will have !NULL balance parameters we
can now assume so and remove a few checks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:15 +01:00
Peter Zijlstra
f492e12ef0 sched: Remove load_balance_newidle()
The two functions: load_balance{,_newidle}() are very similar, with the
following differences:

 - rq->lock usage
 - sb->balance_interval updates
 - *balance check

So remove the load_balance_newidle() call with load_balance(.idle =
CPU_NEWLY_IDLE), explicitly unlock the rq->lock before calling (would be
done by double_lock_balance() anyway), and ignore the other differences
for now.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:14 +01:00
Peter Zijlstra
1af3ed3ddf sched: Unify load_balance{,_newidle}()
load_balance() and load_balance_newidle() look remarkably similar, one
key point they differ in is the condition on when to active balance.

So split out that logic into a separate function.

One side effect is that previously load_balance_newidle() used to fail
and return -1 under these conditions, whereas now it doesn't. I've not
yet fully figured out the whole -1 return case for either
load_balance{,_newidle}().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:13 +01:00
Peter Zijlstra
baa8c1102f sched: Add a lock break for PREEMPT=y
Since load-balancing can hold rq->locks for quite a long while, allow
breaking out early when there is lock contention.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:13 +01:00
Peter Zijlstra
230059de77 sched: Remove from fwd decls
Move code around to get rid of fwd declarations.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:12 +01:00
Peter Zijlstra
897c395f4c sched: Remove rq_iterator from move_one_task
Again, since we only iterate the fair class, remove the abstraction.

Since this is the last user of the rq_iterator, remove all that too.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:11 +01:00
Peter Zijlstra
ee00e66fff sched: Remove rq_iterator usage from load_balance_fair
Since we only ever iterate the fair class, do away with this abstraction.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:10 +01:00
Peter Zijlstra
3d45fd804a sched: Remove the sched_class load_balance methods
Take out the sched_class methods for load-balancing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:09 +01:00
Peter Zijlstra
1e3c88bdeb sched: Move load balance code into sched_fair.c
Straight fwd code movement.

Since non of the load-balance abstractions are used anymore, do away with
them and simplify the code some. In preparation move the code around.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:40:08 +01:00
Yong Zhang
6d558c3ac9 sched: Reassign prev and switch_count when reacquire_kernel_lock() fail
Assume A->B schedule is processing, if B have acquired BKL before and it
need reschedule this time. Then on B's context, it will go to
need_resched_nonpreemptible for reschedule. But at this time, prev and
switch_count are related to A. It's wrong and will lead to incorrect
scheduler statistics.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001102238w7b0ddcadref00d345e2181d11@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:39:04 +01:00
Mike Galbraith
50b926e439 sched: Fix vmark regression on big machines
SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to.  Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.

Reported-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1262612696.15495.15.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-21 13:39:03 +01:00
Xiaotian Feng
ea9d8e3f45 clockevent: Don't remove broadcast device when cpu is dead
Marc reported that the BUG_ON in clockevents_notify() triggers on his
system. This happens because the kernel tries to remove an active
clock event device (used for broadcasting) from the device list.

The handling of devices which can be used as per cpu device and as a
global broadcast device is suboptimal.

The simplest solution for now (and for stable) is to check whether the
device is used as global broadcast device, but this needs to be
revisited.

[ tglx: restored the cpuweight check and massaged the changelog ]

Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
LKML-Reference: <1262834564-13033-1-git-send-email-dfeng@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
2010-01-18 14:44:50 +01:00
Milton Miller
e03bcb6862 generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data
The smp ipi data is passed around and given write access by
other cpus and should be separated from per-cpu data consumed by
this cpu.

Looking for hot lines, I saw call_function_data shared with
tick_cpu_sched.

Signed-off-by: Milton Miller <miltonm@bga.com>
Acked-by: Anton Blanchard <anton@samba.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: : Nick Piggin <npiggin@suse.de>
LKML-Reference: <20100118020051.GR12666@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-18 09:02:59 +01:00
Ingo Molnar
f426a7e029 Merge branch 'perf/scheduling' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core 2010-01-18 08:56:41 +01:00
James Morris
2457552d1e Merge branch 'master' into next 2010-01-18 09:56:22 +11:00
Frederic Weisbecker
329c0e012b perf: Better order flexible and pinned scheduling
When a task gets scheduled in. We don't touch the cpu bound events
so the priority order becomes:

	cpu pinned, cpu flexible, task pinned, task flexible.

So schedule out cpu flexibles when a new task context gets in
and correctly order the groups to schedule in:

	task pinned, cpu flexible, task flexible.

Cpu pinned groups don't need to be touched at this time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
2010-01-17 13:11:05 +01:00
Frederic Weisbecker
7defb0f879 perf: Don't schedule out/in pinned events on task tick
We don't need to schedule in/out pinned events on task tick,
now that pinned and flexible groups can be scheduled separately.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
2010-01-17 13:09:51 +01:00
Frederic Weisbecker
5b0311e1f2 perf: Allow pinned and flexible groups to be scheduled separately
Tune the scheduling helpers so that we can choose to schedule either
pinned and/or flexible groups from a context.

And while at it, refactor a bit the naming of these helpers to make
these more consistent and flexible.

There is no (intended) change in scheduling behaviour in this
patch.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
2010-01-17 13:08:57 +01:00
Frederic Weisbecker
42cce92f4d perf: Make __perf_event_sched_out static
__perf_event_sched_out doesn't need to be globally available, make
it static.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
2010-01-17 13:08:01 +01:00
Masami Hiramatsu
231e36f4d2 tracing/kprobe: Update kprobe tracing self test for new syntax
Update kprobe tracing self test for new syntax (it supports
deleting individual probes, and drops $argN support)
and behavior change (new probes are disabled in default).

This selftest includes the following checks:

 - Adding function-entry probe and return probe with arguments.
 - Enabling these probes.
 - Deleting it individually.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100114051211.7814.29436.stgit@localhost6.localdomain6>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-17 08:15:35 +01:00
H Hartley Sweeten
6d686f4564 sched: Don't expose local functions
kernel/sched: don't expose local functions

The get_rr_interval_* functions are all class methods of
struct sched_class. They are not exported so make them
static.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <201001132021.53253.hartleys@visionengravers.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-17 08:09:45 +01:00
Frederic Weisbecker
24a53652e3 tracing: Drop the tr check from the graph tracing path
Each time we save a function entry from the function graph
tracer, we check if the trace array is set, which is wasteful
because it is set anyway before we start the tracer. All we need
is to ensure we have good read and write orderings. When we set
the trace array, we just need to guarantee it to be visible
before starting tracing.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
LKML-Reference: <1263453795-7496-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-17 08:06:25 +01:00
Linus Torvalds
2a8249daf6 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futexes: Remove rw parameter from get_futex_key()
2010-01-16 12:31:30 -08:00
Linus Torvalds
6ccc347b69 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing/filters: Add comment for match callbacks
  tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING
  tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching
  lib: Introduce strnstr()
  tracing/filters: Fix MATCH_END_ONLY filter matching
  tracing/filters: Fix MATCH_FRONT_ONLY filter matching
  ftrace: Fix MATCH_END_ONLY function filter
  tracing/x86: Derive arch from bits argument in recordmcount.pl
  ring-buffer: Add rb_list_head() wrapper around new reader page next field
  ring-buffer: Wrap a list.next reference with rb_list_head()
2010-01-16 12:27:25 -08:00
David John
af2422c42c smp_call_function_any(): pass the node value to cpumask_of_node()
The change in acpi_cpufreq to use smp_call_function_any causes a warning
when it is called since the function erroneously passes the cpu id to
cpumask_of_node rather than the node that the cpu is on.  Fix this.

cpumask_of_node(3): node > nr_node_ids(1)
Pid: 1, comm: swapper Not tainted 2.6.33-rc3-00097-g2c1f189 #223
Call Trace:
 [<ffffffff81028bb3>] cpumask_of_node+0x23/0x58
 [<ffffffff81061f51>] smp_call_function_any+0x65/0xfa
 [<ffffffff810160d1>] ? do_drv_read+0x0/0x2f
 [<ffffffff81015fba>] get_cur_val+0xb0/0x102
 [<ffffffff81016080>] get_cur_freq_on_cpu+0x74/0xc5
 [<ffffffff810168a7>] acpi_cpufreq_cpu_init+0x417/0x515
 [<ffffffff81562ce9>] ? __down_write+0xb/0xd
 [<ffffffff8148055e>] cpufreq_add_dev+0x278/0x922

Signed-off-by: David John <davidjon@xenontk.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:39 -08:00
Andi Kleen
5dab600e6a kfifo: document everywhere that size has to be power of two
On my first try using them I missed that the fifos need to be power of
two, resulting in a runtime bug.  Document that requirement everywhere
(and fix one grammar bug)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:38 -08:00
Andi Kleen
a5b9e2c106 kfifo: add kfifo_out_peek
In some upcoming code it's useful to peek into a FIFO without permanentely
removing data.  This patch implements a new kfifo_out_peek() to do this.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:38 -08:00
Andi Kleen
64ce1037c5 kfifo: sanitize *_user error handling
Right now for kfifo_*_user it's not easily possible to distingush between
a user copy failing and the FIFO not containing enough data.  The problem
is that both conditions are multiplexed into the same return code.

Avoid this by moving the "copy length" into a separate output parameter
and only return 0/-EFAULT in the main return value.

I didn't fully adapt the weird "record" variants, those seem
to be unused anyways and were rather messy (should they be just removed?)

I would appreciate some double checking if I did all the conversions
correctly.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:38 -08:00
Andi Kleen
8ecc295153 kfifo: use void * pointers for user buffers
The pointers to user buffers are currently unsigned char *, which requires
a lot of casting in the caller for any non-char typed buffers.  Use void *
instead.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Stefani Seibold <stefani@seibold.net>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Walls <awalls@radix.net>
Cc: Vikram Dhillon <dhillonv10@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:38 -08:00
Frederic Weisbecker
d6f962b57b perf: Export software-only event group characteristic as a flag
Before scheduling an event group, we first check if a group can go
on. We first check if the group is made of software only events
first, in which case it is enough to know if the group can be
scheduled in.

For that purpose, we iterate through the whole group, which is
wasteful as we could do this check when we add/delete an event to
a group.

So we create a group_flags field in perf event that can host
characteristics from a group of events, starting with a first
PERF_GROUP_SOFTWARE flag that reduces the check on the fast path.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
2010-01-16 12:30:40 +01:00
Frederic Weisbecker
e286417378 perf: Round robin flexible groups of events using list_rotate_left()
This is more proper that doing it through a list_for_each_entry()
that breaks after the first entry.

v2: Don't rotate pinned groups as its not needed to time share
them.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
2010-01-16 12:30:28 +01:00
Frederic Weisbecker
889ff01506 perf/core: Split context's event group list into pinned and non-pinned lists
Split-up struct perf_event_context::group_list into pinned_groups
and flexible_groups (non-pinned).

This first appears to be useless as it duplicates various loops around
the group list handlings.

But it scales better in the fast-path in perf_sched_in(). We don't
anymore iterate twice through the entire list to separate pinned and
non-pinned scheduling. Instead we interate through two distinct lists.

The another desired effect is that it makes easier to define distinct
scheduling rules on both.

Changes in v2:
- Respectively rename pinned_grp_list and
  volatile_grp_list into pinned_groups and flexible_groups as per
  Ingo suggestion.
- Various cleanups

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
2010-01-16 12:27:42 +01:00
Paul E. McKenney
017c426138 rcu: Fix sparse warnings
Rename local variable "i" in rcu_init() to avoid conflict with
RCU_INIT_FLAVOR(), restrict the scope of RCU_TREE_NONCORE, and
make __synchronize_srcu() static.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12635142581560-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-16 10:25:22 +01:00
Li Zefan
d1303dd1d6 tracing/filters: Add comment for match callbacks
We should be clear on 2 things:

- the length parameter of a match callback includes
  tailing '\0'.

- the string to be searched might not be NULL-terminated.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8770.7000608@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14 22:38:14 -05:00
Li Zefan
16da27a8bc tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING
MATCH_FULL matching for PTR_STRING is not working correctly:

  # echo 'func == vt' > events/bkl/lock_kernel/filter
  # echo 1 > events/bkl/lock_kernel/enable
  ...
  # cat trace
   Xorg-1484  [000]  1973.392586: lock_kernel: ... func=vt_ioctl()
    gpm-1402  [001]  1974.027740: lock_kernel: ... func=vt_ioctl()

We should pass to regex.match(..., len) the length (including '\0')
of the source string instead of the length of the pattern string.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8763.5070707@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14 22:38:12 -05:00
Li Zefan
b2af211f28 tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching
The @str might not be NULL-terminated if it's of type
DYN_STRING or STATIC_STRING, so we should use strnstr()
instead of strstr().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8753.2000102@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14 22:38:11 -05:00
Li Zefan
a3291c14ec tracing/filters: Fix MATCH_END_ONLY filter matching
For '*foo' pattern, we should allow any string ending with
'foo', but event filtering incorrectly disallows strings
like bar_foo_foo:

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8735.6070604@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14 22:38:07 -05:00
Li Zefan
285caad415 tracing/filters: Fix MATCH_FRONT_ONLY filter matching
MATCH_FRONT_ONLY actually is a full matching:

  # ./perf record -R -f -a -e lock:lock_acquire \
	--filter 'name ~rcu_*' sleep 1
  # ./perf trace
  (no output)

We should pass the length of the pattern string to strncmp().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E8721.5090301@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14 22:38:05 -05:00
Li Zefan
751e9983ee ftrace: Fix MATCH_END_ONLY function filter
For '*foo' pattern, we should allow any string ending with
'foo', but ftrace filter incorrectly disallows strings
like bar_foo_foo:

  # echo '*io' > set_ftrace_filter
  # cat set_ftrace_filter | grep 'req_bio_endio'
  # cat available_filter_functions | grep 'req_bio_endio'
  req_bio_endio

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4B4E870E.6060607@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-14 22:38:03 -05:00
Jamie Iles
8381f65d09 sched/perf: Make sure irqs are disabled for perf_event_task_sched_in()
perf_event_task_sched_in() expects interrupts to be disabled,
but on architectures with __ARCH_WANT_INTERRUPTS_ON_CTXSW
defined, this isn't true. If this is defined, disable irqs
around the call in finish_task_switch().

Signed-off-by: Jamie Iles <jamie.iles@picochip.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
LKML-Reference: <1262964453-27370-1-git-send-email-jamie.iles@picochip.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 10:43:08 +01:00
Masami Hiramatsu
14640106f2 tracing/kprobe: Drop function argument access syntax
Drop function argument access syntax, because the function
arguments depend on not only architecture but also
compile-options and function API. And now, we have perf-probe
for finding register/memory assigned to each argument.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: linuxppc-dev@ozlabs.org
LKML-Reference: <20100105224648.19431.52309.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 10:09:12 +01:00
Ingo Molnar
61405fea92 Merge branch 'perf/urgent' into perf/core
Merge reason: queue up dependent patch, update to -rc4

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 10:08:50 +01:00
KOSAKI Motohiro
7485d0d375 futexes: Remove rw parameter from get_futex_key()
Currently, futexes have two problem:

A) The current futex code doesn't handle private file mappings properly.

get_futex_key() uses PageAnon() to distinguish file and
anon, which can cause the following bad scenario:

  1) thread-A call futex(private-mapping, FUTEX_WAIT), it
     sleeps on file mapping object.
  2) thread-B writes a variable and it makes it cow.
  3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
     wakes up blocked thread on the anonymous page. (but it's nothing)

B) Current futex code doesn't handle zero page properly.

Read mode get_user_pages() can return zero page, but current
futex code doesn't handle it at all. Then, zero page makes
infinite loop internally.

The solution is to use write mode get_user_page() always for
page lookup. It prevents the lookup of both file page of private
mappings and zero page.

Performance concerns:

Probaly very little, because glibc always initialize variables
for futex before to call futex(). It means glibc users never see
the overhead of this patch.

Compatibility concerns:

This patch has few compatibility issues. After this patch,
FUTEX_WAIT require writable access to futex variables (read-only
mappings makes EFAULT). But practically it's not a problem,
glibc always initalizes variables for futexes explicitly - nobody
uses read-only mappings.

Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Cc: <stable@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Ulrich Drepper <drepper@gmail.com>
LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:17:36 +01:00
Paul E. McKenney
b6407e8639 rcu: Give different levels of the rcu_node hierarchy distinct lockdep names
Previously, each level of the rcu_node hierarchy had the same
rather unimaginative name: "&rcu_node_class[i]".  This makes
lockdep diagnostics involving these lockdep classes less helpful
than would be nice. This patch fixes this by giving each level
of the rcu_node hierarchy a distinct name: "rcu_node_level_0",
"rcu_node_level_1", and so on. This version of the patch
includes improved diagnostics suggested by Josh Triplett and
Peter Zijlstra.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626498421830-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:07 +01:00
Paul E. McKenney
cba8244a0f rcu: Add debug check for too many rcu_read_unlock()
TREE_PREEMPT_RCU maintains an rcu_read_lock_nesting counter in
the task structure, which happens to be a signed int.  So this
patch adds a check for this counter being negative at the end of
__rcu_read_unlock(). This check is under CONFIG_PROVE_LOCKING,
so can be thought of as being part of lockdep.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626498423064-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:06 +01:00
Paul E. McKenney
bf66f18e79 rcu: Add force_quiescent_state() testing to rcutorture
Add force_quiescent_state() testing to rcutorture, with a
separate thread that repeatedly invokes force_quiescent_state()
in bursts. This can greatly increase the probability of
encountering certain types of race conditions.

Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1262646551116-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:05 +01:00
Paul E. McKenney
46a1e34eda rcu: Make force_quiescent_state() start grace period if needed
Grace periods cannot be started while force_quiescent_state() is
active.  This is OK in that the affected CPUs will try again
later, but it does induce needless grace-period delays.  This
patch causes rcu_start_gp() to record a failed attempt to start
a grace period. When force_quiescent_state() prepares to return,
it then starts the grace period if there was such a failed
attempt.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501854-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:05 +01:00
Paul E. McKenney
45f014c52e rcu: Remove redundant grace-period check
The rcu_process_dyntick() function checks twice for the end of
the current grace period.  However, it holds the current
rcu_node structure's ->lock field throughout, and doesn't get to
the second call to rcu_gp_in_progress() unless there is at least
one CPU corresponding to this rcu_node structure that has not
yet checked in for the current grace period, which would prevent
the current grace period from ending. So the current grace
period cannot have ended, and the second check is redundant, so
remove it.

Also, given that this function is used even with !CONFIG_NO_HZ,
its name is quite misleading.  Change from rcu_process_dyntick()
to force_qs_rnp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1262646550562-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:04 +01:00
Paul E. McKenney
ee47eb9f4d rcu: Remove leg of force_quiescent_state() switch statement
The comparisons of rsp->gpnum nad rsp->completed in
rcu_process_dyntick() and force_quiescent_state() can be
replaced by the much more clear rcu_gp_in_progress() predicate
function.  After doing this, it becomes clear that the
RCU_SAVE_COMPLETED leg of the force_quiescent_state() function's
switch statement is almost completely a no-op.  A small change
to the RCU_SAVE_DYNTICK leg renders it a complete no-op, after
which it can be removed.  Doing so also eliminates the forcenow
local variable from force_quiescent_state().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501781-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:04 +01:00
Paul E. McKenney
0f10dc8266 rcu: Eliminate rcu_process_dyntick() return value
Because a new grace period cannot start while we are executing
within the force_quiescent_state() function's switch statement,
if any test within that switch statement or within any function
called from that switch statement shows that the current grace
period has ended, we can safely re-do that test any time before
we leave the switch statement.  This means that we no longer
need a return value from rcu_process_dyntick(), as we can simply
invoke rcu_gp_in_progress() to check whether the old grace
period has finished -- there is no longer any need to worry
about whether or not a new grace period has been started.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501857-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:03 +01:00
Paul E. McKenney
eb1ba45f1e rcu: Eliminate second argument of rcu_process_dyntick()
At this point, the second argument to all calls to
rcu_process_dyntick() is a function of the same field of the
structure passed in as the first argument, namely, rsp->gpnum-1.
 So propagate rsp->gpnum-1 to all uses of the second argument
within rcu_process_dyntick() and then eliminate the second
argument.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465503786-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:03 +01:00
Paul E. McKenney
39c0bbfc07 rcu: Eliminate local variable lastcomp from force_quiescent_state()
Because rsp->fqs_active is set to 1 across
force_quiescent_state()'s switch statement, rcu_start_gp() will
refrain from starting a new grace period during this time.
Therefore, rsp->gpnum is constant, and can be propagated to all
uses of lastcomp, eliminating this local variable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465502985-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:03 +01:00
Paul E. McKenney
f3a8b5c6aa rcu: Eliminate local variable signaled from force_quiescent_state()
Because the root rcu_node lock is held across entry to the
switch statement in force_quiescent_state(), it is no longer
necessary to snapshot rsp->signaled to a local variable.
Eliminate both the snapshotting and the local variable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1262646550602-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:02 +01:00
Paul E. McKenney
07079d5357 rcu: Prohibit starting new grace periods while forcing quiescent states
Reduce the number and variety of race conditions by prohibiting
the start of a new grace period while force_quiescent_state() is
active. A new fqs_active flag in the rcu_state structure is used
to trace whether or not force_quiescent_state() is active, and
this new flag is tested by rcu_start_gp().  If the CPU that
closed out the last grace period needs another grace period,
this new grace period may be delayed up to one scheduling-clock
tick, but it will eventually get started.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <126264655052-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:02 +01:00
Paul E. McKenney
559569acf9 rcu: Adjust force_quiescent_state() locking, step 2
This patch releases rnp->lock after the end of
force_quiescent_state()'s switch statement.  This is a second
step towards prohibiting starting grace periods while
force_quiescent_state() is executing, which will reduce the
number and complexity of races that force_quiescent_state() is
involved in.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501994-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:01 +01:00
Paul E. McKenney
f96e9232e0 rcu: Adjust force_quiescent_state() locking, step 1
This causes rnp->lock to be held on entry to
force_quiescent_state()'s switch statement.  This is a first
step towards prohibiting starting grace periods while
force_quiescent_state() is executing, which will reduce the
number and complexity of races that force_quiescent_state() is
involved in.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12626465501455-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-01-13 09:06:01 +01:00
Andi Kleen
b45c6e76bc kernel/signal.c: fix kernel information leak with print-fatal-signals=1
When print-fatal-signals is enabled it's possible to dump any memory
reachable by the kernel to the log by simply jumping to that address from
user space.

Or crash the system if there's some hardware with read side effects.

The fatal signals handler will dump 16 bytes at the execution address,
which is fully controlled by ring 3.

In addition when something jumps to a unmapped address there will be up to
16 additional useless page faults, which might be potentially slow (and at
least is not very efficient)

Fortunately this option is off by default and only there on i386.

But fix it by checking for kernel addresses and also stopping when there's
a page fault.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:05 -08:00
Dave Anderson
bd4f490a07 cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput()
The LTP cgroup test suite generates a "kernel BUG at kernel/cgroup.c:790!"
here in cgroup_diput():

                 /*
                  * if we're getting rid of the cgroup, refcount should ensure
                  * that there are no pidlists left.
                  */
                 BUG_ON(!list_empty(&cgrp->pidlists));

The cgroup pidlist rework in 2.6.32 generates the BUG_ON, which is caused
when pidlist_array_load() calls cgroup_pidlist_find():

(1) if a matching cgroup_pidlist is found, it down_write's the mutex of the
     pre-existing cgroup_pidlist, and increments its use_count.
(2) if no matching cgroup_pidlist is found, then a new one is allocated, it
     down_write's its mutex, and the use_count is set to 0.
(3) the matching, or new, cgroup_pidlist gets returned back to pidlist_array_load(),
     which increments its use_count -- regardless whether new or pre-existing --
     and up_write's the mutex.

So if a matching list is ever encountered by cgroup_pidlist_find() during
the life of a cgroup directory, it results in an inflated use_count value,
preventing it from ever getting released by cgroup_release_pid_array().
Then if the directory is subsequently removed, cgroup_diput() hits the
BUG_ON() when it finds that the directory's cgroup is still populated with
a pidlist.

The patch simply removes the use_count increment when a matching pidlist
is found by cgroup_pidlist_find(), because it gets bumped by the calling
pidlist_array_load() function while still protected by the list's mutex.

Signed-off-by: Dave Anderson <anderson@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Paul Menage <menage@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:05 -08:00
Masami Hiramatsu
8767ba2796 kmod: fix resource leak in call_usermodehelper_pipe()
Fix resource (write-pipe file) leak in call_usermodehelper_pipe().

When call_usermodehelper_exec() fails, write-pipe file is opened and
call_usermodehelper_pipe() just returns an error.  Since it is hard for
caller to determine whether the error occured when opening the pipe or
executing the helper, the caller cannot close the pipe by themselves.

I've found this resoruce leak when testing coredump.  You can check how
the resource leaks as below;

$ echo "|nocommand" > /proc/sys/kernel/core_pattern
$ ulimit -c unlimited
$ while [ 1 ]; do ./segv; done &> /dev/null &
$ cat /proc/meminfo (<- repeat it)

where segv.c is;
//-----
int main () {
        char *p = 0;
        *p = 1;
}
//-----

This patch closes write-pipe file if call_usermodehelper_exec() failed.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11 09:34:04 -08:00
Steven Rostedt
0e1ff5d72a ring-buffer: Add rb_list_head() wrapper around new reader page next field
If the very unlikely case happens where the writer moves the head by one
between where the head page is read and where the new reader page
is assigned _and_ the writer then writes and wraps the entire ring buffer
so that the head page is back to what was originally read as the head page,
the page to be swapped will have a corrupted next pointer.

Simple solution is to wrap the assignment of the next pointer with a
rb_list_head().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 20:40:44 -05:00
David Sharp
5ded3dc6a3 ring-buffer: Wrap a list.next reference with rb_list_head()
This reference at the end of rb_get_reader_page() was causing off-by-one
writes to the prev pointer of the page after the reader page when that
page is the head page, and therefore the reader page has the RB_PAGE_HEAD
flag in its list.next pointer. This eventually results in a GPF in a
subsequent call to rb_set_head_page() (usually from rb_get_reader_page())
when that prev pointer is dereferenced. The dereferenced register would
characteristically have an address that appears shifted left by one byte
(eg, ffxxxxxxxxxxxxyy instead of ffffxxxxxxxxxxxx) due to being written at
an address one byte too high.

Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1262826727-9090-1-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 20:38:25 -05:00
Steven Rostedt
d931369b74 tracing: Add stack dump to trace_printk if stacktrace option is set
If the ftrace stacktrace option is set, then add the stack dumps to
trace_printk.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 18:09:57 -05:00
Lai Jiangshan
7e53bd42d1 tracing: Consolidate protection of reader access to the ring buffer
At the beginning, access to the ring buffer was fully serialized
by trace_types_lock. Patch d7350c3f45 gives more freedom to readers,
and patch b04cc6b1f6 adds code to protect trace_pipe and cpu#/trace_pipe.

But actually it is not enough, ring buffer readers are not always
read-only, they may consume data.

This patch makes accesses to trace, trace_pipe, trace_pipe_raw
cpu#/trace, cpu#/trace_pipe and cpu#/trace_pipe_raw serialized.
And removes tracing_reader_cpumask which is used to protect trace_pipe.

Details:

Ring buffer serializes readers, but it is low level protection.
The validity of the events (which returns by ring_buffer_peek() ..etc)
are not protected by ring buffer.

The content of events may become garbage if we allow another process to consume
these events concurrently:
  A) the page of the consumed events may become a normal page
     (not reader page) in ring buffer, and this page will be rewritten
     by the events producer.
  B) The page of the consumed events may become a page for splice_read,
     and this page will be returned to system.

This patch adds trace_access_lock() and trace_access_unlock() primitives.

These primitives allow multi process access to different cpu ring buffers
concurrently.

These primitives don't distinguish read-only and read-consume access.
Multi read-only access is also serialized.

And we don't use these primitives when we open files,
we only use them when we read files.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B447D52.1050602@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 12:51:34 -05:00
Lai Jiangshan
0fa0edaf32 tracing: Remove show_format and related macros from TRACE_EVENT
The previous patches added the use of print_fmt string and changes
the trace_define_field() function to also create the fields and
format output for the event format files.

   text	   data	    bss	    dec	    hex	filename
5857201	1355780	9336808	16549789	 fc879d	vmlinux
5884589	1351684	9337896	16574169	 fce6d9	vmlinux-orig

The above shows the size of the vmlinux after this patch set
compared to the vmlinux-orig which is before the patch set.

This saves us 27k on text, 1k on bss and adds just 4k of data.

The total savings of 24k in size.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D4D.40604@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 12:08:46 -05:00
Lai Jiangshan
5a65e95622 tracing: Use defined fields and print_fmt to print formats
The calls ftrace_format_##call() and ftrace_define_fields_##call()
are almost duplicate in functionality. With the addition of the
print_fmt in previous patches, these two functions can be merged
into one.

The trace_define_field() defines the fields and links them into
the struct ftrace_event_call. The previous patches introduced
the print_fmt field and this can now be used with the trace_define_field()
to create the event format file fields and print_fmt field.

The struct ftrace_event_call->fields are used to print the fields
The struct ftrace_event_call->print_fmt is used to print
the "print fmt: XXXXXXXXXXX" line.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D49.5000006@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 12:08:20 -05:00
Steven Rostedt
c7ef3a9004 tracing: Have syscall tracing call its own init function
In the clean up of having all events call one specific function,
the syscall event init was changed to call this helper function.

With the new print_fmt updates, the syscalls need to do special
initializations. This patch converts the syscall events to call
its own init function again.

Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 12:02:32 -05:00
Lai Jiangshan
a342a0280b tracing/kprobes: Init print_fmt for kprobe events
This is part of a patch set that removes the show_format method
in the ftrace event macros.

Add the print_fmt initialization to the kprobe events.
The print_fmt is still not used, but will be in the follow up
patches.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D45.3080100@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 12:01:35 -05:00
Lai Jiangshan
50307a45f8 tracing/syscalls: Init print_fmt for syscall events
This is part of a patch set that removes the show_format method
in the ftrace event macros.

Add the print_fmt initialization to the syscall events.
The print_fmt is still not used, but will be in the follow up
patches.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D41.609@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 11:58:32 -05:00
Lai Jiangshan
509e760cd9 tracing: Add print_fmt field
This is part of a patch set that removes the show_format method
in the ftrace event macros.

The print_fmt field is added to hold the string that shows
the print_fmt in the event format files. This patch only adds
the field but it is currently not used. Later patches will use
this field to enable us to remove the show_format field
and function.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D3E.2000704@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 11:41:54 -05:00
Lai Jiangshan
809826a389 tracing: Have __dynamic_array() define a field
This is part of a patch set that removes the show_format method
in the ftrace event macros.

This patch set requires that all fields are added to the
ftrace_event_call->fields. This patch changes __dynamic_array()
to call trace_define_field() to include fields that use __dynamic_array().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D36.8090100@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-01-06 11:30:02 -05:00
Ben Hutchings
10b465aaf9 modules: Skip empty sections when exporting section notes
Commit 35dead4 "modules: don't export section names of empty sections
via sysfs" changed the set of sections that have attributes, but did
not change the iteration over these attributes in add_notes_attrs().
This can lead to add_notes_attrs() creating attributes with the wrong
names or with null name pointers.

Introduce a sect_empty() function and use it in both add_sect_attrs()
and add_notes_attrs().

Reported-by: Martin Michlmayr <tbm@cyrius.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Tested-by: Martin Michlmayr <tbm@cyrius.com>
Cc: stable@kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06 01:11:29 -08:00
Steffen Klassert
16295bec63 padata: Generic parallelization/serialization interface
This patch introduces an interface to process data objects
in parallel. The parallelized objects return after serialization
in the same order as they were before the parallelization.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-01-06 19:47:10 +11:00
Christoph Lameter
79615760f3 local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
ringbuffer*.c are the last users of local.h.

Remove the include from modules.h and add it to ringbuffer files.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-01-05 15:34:50 +09:00
Christoph Lameter
e1783a240f module: Use this_cpu_xx to dynamically allocate counters
Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.

V8-V9:
- Leave include asm/module.h since ringbuffer.c depends on it. Nothing else
  does though. Another patch will deal with that.
- Remove spurious free.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-01-05 15:34:50 +09:00
Tejun Heo
32032df6c2 Merge branch 'master' into percpu
Conflicts:
	arch/powerpc/platforms/pseries/hvCall.S
	include/linux/percpu.h
2010-01-05 09:17:33 +09:00
Linus Torvalds
952363c90c Merge branch 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: Fix NULL deref in inheritance code
  perf: Pass appropriate frame pointer to dump_trace()
2009-12-31 11:56:24 -08:00
Linus Torvalds
9d6e323c68 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf kmem: Fix statistics typo
  kprobes: Fix distinct type warning
  perf: Rename perf_event_hw_event in design document
  perf tools: Add missing header files to LIB_H Makefile variable
  perf record: We should fork only if a program was specified to run
  perf diff: Fix usage array, it must end with a NULL entry
2009-12-31 11:52:24 -08:00
Linus Torvalds
b21c070403 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Fix sign fields in ftrace_define_fields_##call()
  tracing/syscalls: Fix typo in SYSCALL_DEFINE0
  tracing/kprobe: Show sign of fields in trace_kprobe format files
  ksym_tracer: Remove trace_stat
  ksym_tracer: Fix race when incrementing count
  ksym_tracer: Fix to allow writing newline to ksym_trace_filter
  ksym_tracer: Fix to make the tracer work
  tracing: Kconfig spelling fixes and cleanups
  tracing: Fix setting tracer specific options
  Documentation: Update ftrace-design.txt
  Documentation: Update tracepoint-analysis.txt
  Documentation: Update mmiotrace.txt
2009-12-31 11:52:01 -08:00
KOSAKI Motohiro
0f4bd46ec2 kmsg_dump: Dump on crash_kexec as well
crash_kexec gets called before kmsg_dump(KMSG_DUMP_OOPS) if
panic_on_oops is set, so the kernel log buffer is not stored
for this case.

This patch adds a KMSG_DUMP_KEXEC dump type which gets called
when crash_kexec() is invoked. To avoid getting double dumps,
the old KMSG_DUMP_PANIC is moved below crash_kexec(). The
mtdoops driver is modified to handle KMSG_DUMP_KEXEC in the
same way as a panic.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-12-31 19:45:04 +00:00
Peter Zijlstra
05cbaa2853 perf: Fix NULL deref in inheritance code
Liming found a NULL deref when a task has a perf context but no
counters  when it forks.

This can occur in two cases, a race during construction where
the fork hits after installing the context but before the first
counter gets inserted, or more reproducably, a fork after the
last counter is closed (which leaves the context around).

Reported-by: Wang Liming <liming.wang@windriver.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
CC: <stable@kernel.org>
LKML-Reference: <1262185684.7135.222.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-31 13:11:31 +01:00
Lai Jiangshan
fb7ae981cb tracing: Fix sign fields in ftrace_define_fields_##call()
Add is_signed_type() call to trace_define_field() in ftrace macros.

The code previously just passed in 0 (false), disregarding whether
or not the field was actually a signed type.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D3A.6020007@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-30 10:27:06 -05:00
Lai Jiangshan
79b4082108 tracing/kprobe: Show sign of fields in trace_kprobe format files
The format files of trace_kprobe do not show the sign of the fields.
The other format files show the field signed type of the fields and
this patch makes the trace_kprobe formats consistent with the others.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B273D27.5040009@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-30 10:27:03 -05:00
Li Zefan
53ab668064 ksym_tracer: Remove trace_stat
trace_stat is problematic. Don't use it, use seqfile instead.

This fixes a race that reading the stat file is not protected by
any lock, which can lead to use after free.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF203.40200@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-30 07:50:50 +01:00
Li Zefan
e6d9491bf8 ksym_tracer: Fix race when incrementing count
We are under rcu read section but not holding the write lock, so
count++ is not atomic. Use atomic64_t instead.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF1EC.9010608@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-30 07:50:49 +01:00
Li Zefan
3d13ec2efd ksym_tracer: Fix to allow writing newline to ksym_trace_filter
It used to work, but now doesn't:

 # echo > ksym_filter
 bash: echo: write error: Invalid argument

It's caused by d954fbf0ff
("tracing: Fix wrong usage of strstrip in trace_ksyms").

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF1D7.5040400@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-30 07:50:49 +01:00
Li Zefan
88f7a890d7 ksym_tracer: Fix to make the tracer work
ksym tracer doesn't work:

 # echo tasklist_lock:rw- > ksym_trace_filter
 -bash: echo: write error: No such device

It's because we pass to perf_event_create_kernel_counter()
a cpu number which is not present.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B3AF19E.1010201@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-30 07:50:47 +01:00
Simon Kagstrom
d894837f23 sched: might_sleep(): Make file parameter const char *
Fixes a warning when building with g++:

 warning: deprecated conversion from string constant to 'char*'

And the file parameter use is constant, so mark it as such.

Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Cc: peterz@infradead.org
LKML-Reference: <20091223110818.442d848e@marrow.netinsight.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-28 10:50:13 +01:00
Randy Dunlap
40892367bc tracing: Kconfig spelling fixes and cleanups
Fix filename reference (ftrace-implementation.txt ->
ftrace-design.txt).

Fix spelling, punctuation, grammar.

Fix help text indentation and line lengths to reduce need for
horizontal scrolling or larger window sizes.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091221120117.3fb49cdc.randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-28 10:37:54 +01:00
Li Zefan
07b139c8c8 perf events: Remove CONFIG_EVENT_PROFILE
Quoted from Ingo:

| This reminds me - i think we should eliminate CONFIG_EVENT_PROFILE -
| it's an unnecessary Kconfig complication. If both PERF_EVENTS and
| EVENT_TRACING is enabled we should expose generic tracepoints.
|
| Nor is it limited to event 'profiling', so it has become a misnomer as
| well.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B2F1557.2050705@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-28 10:33:06 +01:00
Heiko Carstens
c2ef6661ce kprobes: Fix distinct type warning
Every time I see this:

 kernel/kprobes.c: In function 'register_kretprobe':
 kernel/kprobes.c:1038: warning: comparison of distinct pointer types lacks a cast

I'm wondering if something changed in common code and we need to
do something for s390. Apparently that's not the case.
Let's get rid of this annoying warning.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <20091221120224.GA4471@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-28 10:25:31 +01:00
Peter Zijlstra
49f474331e perf events: Remove arg from perf sched hooks
Since we only ever schedule the local cpu, there is no need to pass the
cpu number to the perf sched hooks.

This micro-optimizes things a bit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-28 09:21:33 +01:00
Linus Torvalds
0b5e2588d8 Merge branch 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6
* 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6:
  SYSCTL: Add a mutex to the page_alloc zone order sysctl
  SYSCTL: Print binary sysctl warnings (nearly) only once
2009-12-24 13:01:29 -08:00
Andi Kleen
4440095c82 SYSCTL: Print binary sysctl warnings (nearly) only once
When printing legacy sysctls print the warning message
for each of them only once.  This way there is a guarantee
the syslog won't be flooded for any sane program.

The original attempt at this made the tables non const and stored
the flag inline.

Linus suggested using a separate hash table for this, this is based on a
code snippet from him.

The hash implies this is not exact and can sometimes not print a
new sysctl due to a hash collision, but in practice this should not
be a problem

I used a FNV32 hash over the binary string with a 32byte bitmap. This
gives relatively little collisions when all the predefined binary sysctls
are hashed:

size 256
bucket
length      number
0:          [25]
1:          [67]
2:          [88]
3:          [47]
4:          [22]
5:          [6]
6:          [1]

The worst case is a single collision of 6 hash values.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-12-23 21:00:20 +01:00
Linus Torvalds
6432ed648a Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Revert 738d2be, simplify set_task_cpu()
2009-12-23 09:12:57 -08:00
Peter Zijlstra
0c69774e6c sched: Revert 738d2be, simplify set_task_cpu()
Effectively reverts 738d2be430.

As demonstrated by Eric, we really need to call __set_task_cpu()
early in the fork() path to properly initialize the various task
state -- specifically the cgroup state through set_task_rq().

[ we could probably fix this by explicitly calling
  __set_task_cpu() from   sched_fork(), but lets try that for the
  next cycle and simply revert to the old behaviour for now. ]

Reported-by: Eric Paris <eparis@redhat.com>
Tested-by: Eric Paris <eparis@redhat.com>,
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: efault@gmx.de
LKML-Reference: <1261492999.4937.36.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-23 10:04:10 +01:00
Linus Torvalds
fe35d4a028 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  jfs: Fix 32bit build warning
  Remove obsolete comment in fs.h
  Sanitize f_flags helpers
  Fix f_flags/f_mode in case of lookup_instantiate_filp() from open(pathname, 3)
  anonfd: Allow making anon files read-only
  fs/compat_ioctl.c: fix build error when !BLOCK
  pohmelfs needs I_LOCK
  alloc_file(): simplify handling of mnt_clone_write() errors
2009-12-22 14:20:48 -08:00
Stefani Seibold
86d4880313 kfifo: add record handling functions
Add kfifo_in_rec() - puts some record data into the FIFO
 Add kfifo_out_rec() - gets some record data from the FIFO
 Add kfifo_from_user_rec() - puts some data from user space into the FIFO
 Add kfifo_to_user_rec() - gets data from the FIFO and write it to user space
 Add kfifo_peek_rec() - gets the size of the next FIFO record field
 Add kfifo_skip_rec() - skip the next fifo out record
 Add kfifo_avail_rec() - determinate the number of bytes available in a record FIFO

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-22 14:17:56 -08:00
Stefani Seibold
a121f24acc kfifo: add kfifo_skip, kfifo_from_user and kfifo_to_user
Add kfifo_reset_out() for save lockless discard the fifo output
 Add kfifo_skip() to skip a number of output bytes
 Add kfifo_from_user() to copy user space data into the fifo
 Add kfifo_to_user() to copy fifo data to user space

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-22 14:17:56 -08:00
Stefani Seibold
7acd72eb85 kfifo: rename kfifo_put... into kfifo_in... and kfifo_get... into kfifo_out...
rename kfifo_put...  into kfifo_in...  to prevent miss use of old non in
kernel-tree drivers

ditto for kfifo_get...  -> kfifo_out...

Improve the prototypes of kfifo_in and kfifo_out to make the kerneldoc
annotations more readable.

Add mini "howto porting to the new API" in kfifo.h

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-22 14:17:56 -08:00
Stefani Seibold
e64c026dd0 kfifo: cleanup namespace
change name of __kfifo_* functions to kfifo_*, because the prefix __kfifo
should be reserved for internal functions only.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-22 14:17:56 -08:00
Stefani Seibold
c1e13f2567 kfifo: move out spinlock
Move the pointer to the spinlock out of struct kfifo.  Most users in
tree do not actually use a spinlock, so the few exceptions now have to
call kfifo_{get,put}_locked, which takes an extra argument to a
spinlock.

Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-22 14:17:56 -08:00
Stefani Seibold
4546548789 kfifo: move struct kfifo in place
This is a new generic kernel FIFO implementation.

The current kernel fifo API is not very widely used, because it has to
many constrains.  Only 17 files in the current 2.6.31-rc5 used it.
FIFO's are like list's a very basic thing and a kfifo API which handles
the most use case would save a lot of development time and memory
resources.

I think this are the reasons why kfifo is not in use:

 - The API is to simple, important functions are missing
 - A fifo can be only allocated dynamically
 - There is a requirement of a spinlock whether you need it or not
 - There is no support for data records inside a fifo

So I decided to extend the kfifo in a more generic way without blowing up
the API to much.  The new API has the following benefits:

 - Generic usage: For kernel internal use and/or device driver.
 - Provide an API for the most use case.
 - Slim API: The whole API provides 25 functions.
 - Linux style habit.
 - DECLARE_KFIFO, DEFINE_KFIFO and INIT_KFIFO Macros
 - Direct copy_to_user from the fifo and copy_from_user into the fifo.
 - The kfifo itself is an in place member of the using data structure, this save an
   indirection access and does not waste the kernel allocator.
 - Lockless access: if only one reader and one writer is active on the fifo,
   which is the common use case, no additional locking is necessary.
 - Remove spinlock - give the user the freedom of choice what kind of locking to use if
   one is required.
 - Ability to handle records. Three type of records are supported:
   - Variable length records between 0-255 bytes, with a record size
     field of 1 bytes.
   - Variable length records between 0-65535 bytes, with a record size
     field of 2 bytes.
   - Fixed size records, which no record size field.
 - Preserve memory resource.
 - Performance!
 - Easy to use!

This patch:

Since most users want to have the kfifo as part of another object,
reorganize the code to allow including struct kfifo in another data
structure.  This requires changing the kfifo_alloc and kfifo_init
prototypes so that we pass an existing kfifo pointer into them.  This
patch changes the implementation and all existing users.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-22 14:17:55 -08:00
Linus Torvalds
83f57a11d8 Revert "time: Remove xtime_cache"
This reverts commit 7bc7d63745, as
requested by John Stultz. Quoting John:

 "Petr Titěra reported an issue where he saw odd atime regressions with
  2.6.33 where there were a full second worth of nanoseconds in the
  nanoseconds field.

  He also reviewed the time code and narrowed down the problem: unhandled
  overflow of the nanosecond field caused by rounding up the
  sub-nanosecond accumulated time.

  Details:

   * At the end of update_wall_time(), we currently round up the
  sub-nanosecond portion of accumulated time when storing it into xtime.
  This was added to avoid time inconsistencies caused when the
  sub-nanosecond portion was truncated when storing into xtime.
  Unfortunately we don't handle the possible second overflow caused by
  that rounding.

   * Previously the xtime_cache code hid this overflow by normalizing the
  xtime value when storing into the xtime_cache.

   * We could try to handle the second overflow after the rounding up, but
  since this affects the timekeeping's internal state, this would further
  complicate the next accumulation cycle, causing small errors in ntp
  steering. As much as I'd like to get rid of it, the xtime_cache code is
  known to work.

   * The correct fix is really to include the sub-nanosecond portion in the
  timekeeping accessor function, so we don't need to round up at during
  accumulation. This would greatly simplify the accumulation code.
  Unfortunately, we can't do this safely until the last three
  non-GENERIC_TIME arches (sparc32, arm, cris) are converted  (those
  patches are in -mm) and we kill off the spots where arches set xtime
  directly. This is all 2.6.34 material, so I think reverting the
  xtime_cache change is the best approach for now.

  Many thanks to Petr for both reporting and finding the issue!"

Reported-by: Petr Titěra <P.Titera@century.cz>
Requested-by: john stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-22 14:10:37 -08:00
Al Viro
5300990c03 Sanitize f_flags helpers
* pull ACC_MODE to fs.h; we have several copies all over the place
* nightmarish expression calculating f_mode by f_flags deserves a helper
too (OPEN_FMODE(flags))

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-22 12:27:34 -05:00
Roland Dreier
628ff7c1d8 anonfd: Allow making anon files read-only
It seems a couple places such as arch/ia64/kernel/perfmon.c and
drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile()
instead of a private pseudo-fs + alloc_file(), if only there were a way
to get a read-only file.  So provide this by having anon_inode_getfile()
create a read-only file if we pass O_RDONLY in flags.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-22 12:27:34 -05:00
Steven Rostedt
c757bea93b tracing: Fix setting tracer specific options
The function __set_tracer_option() takes as its last parameter a
"neg" value. If set it should negate the value of the option.

The trace_options_write() passed the value written to the file
which is what the new value needs to be set as. But since this
is not the negative, it never sets the value.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-21 22:35:16 -05:00
Dominik Brodowski
0e2c8b8f55 resources: fix call to alignf() in allocate_resource()
The second parameter to alignf() in allocate_resource() must
reflect what new resource is attempted to be allocated, else
functions like pcibios_align_resource() (at least on x86) or
pcmcia_align() can't work correctly.

Commit 1e5ad96790 broke this by
setting the "new" resource until we're about to return success.
To keep the resource untouched when allocate_resource() fails,
a "tmp" resource is introduced.

Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-21 10:42:29 -08:00
Peter Zijlstra
70f1120527 sched: Fix hotplug hang
The hot-unplug kstopmachine usage does a wakeup after
deactivating the cpu, hence we cannot use cpu_active()
here but must rely on the good olde online.

Reported-by: Sachin Sant <sachinp@in.ibm.com>
Reported-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LKML-Reference: <1261326987.4314.24.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-20 23:31:23 +01:00
Peter Zijlstra
3df0fc5b2e sched: Restore printk sanity
Revert the braindead pr_* crap. (Commit 663997d "sched: Use
pr_fmt() and pr_<level>()")

It's dumb and causes stupid "sched: " strings all over the place.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1261315437.4314.6.camel@laptop>
[ i dont mind the pr_*() patterns that much - but Peter dislikes them with a vengence. ]
[ - v2: remove spurious diffstat from changelog :-/ ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-20 19:05:02 +01:00
Linus Torvalds
eca9dfcd00 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf session: Make events_stats u64 to avoid overflow on 32-bit arches
  hw-breakpoints: Fix hardware breakpoints -> perf events dependency
  perf events: Dont report side-band events on each cpu for per-task-per-cpu events
  perf events, x86/stacktrace: Fix performance/softlockup by providing a special frame pointer-only stack walker
  perf events, x86/stacktrace: Make stack walking optional
  perf events: Remove unused perf_counter.h header file
  perf probe: Check new event name
  kprobe-tracer: Check new event/group name
  perf probe: Check whether debugfs path is correct
  perf probe: Fix libdwarf include path for Debian
2009-12-19 09:48:42 -08:00
Linus Torvalds
aac3d39693 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
  sched: Fix broken assertion
  sched: Assert task state bits at build time
  sched: Update task_state_arraypwith new states
  sched: Add missing state chars to TASK_STATE_TO_CHAR_STR
  sched: Move TASK_STATE_TO_CHAR_STR near the TASK_state bits
  sched: Teach might_sleep() about preemptible RCU
  sched: Make warning less noisy
  sched: Simplify set_task_cpu()
  sched: Remove the cfs_rq dependency from set_task_cpu()
  sched: Add pre and post wakeup hooks
  sched: Move kthread_bind() back to kthread.c
  sched: Fix select_task_rq() vs hotplug issues
  sched: Fix sched_exec() balancing
  sched: Ensure set_task_cpu() is never called on blocked tasks
  sched: Use TASK_WAKING for fork wakups
  sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE
  sched: Fix task_hot() test order
  sched: Fix set_cpu_active() in cpu_down()
  sched: Mark boot-cpu active before smp_init()
  sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
  ...
2009-12-19 09:47:49 -08:00
Linus Torvalds
10e5453ffa Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sys: Fix missing rcu protection for __task_cred() access
  signals: Fix more rcu assumptions
  signal: Fix racy access to __task_cred in kill_pid_info_as_uid()
2009-12-19 09:47:34 -08:00
Linus Torvalds
3cd312c3e8 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timers: Remove duplicate setting of new_base in __mod_timer()
  clockevents: Prevent clockevent_devices list corruption on cpu hotplug
2009-12-19 09:47:18 -08:00
Al Viro
b4c30aad39 fix more leaks in audit_tree.c tag_chunk()
Several leaks in audit_tree didn't get caught by commit
318b6d3d7d, including the leak on normal
exit in case of multiple rules refering to the same chunk.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-19 09:27:43 -08:00
Al Viro
6f5d511489 fix braindamage in audit_tree.c untag_chunk()
... aka "Al had badly fscked up when writing that thing and nobody
noticed until Eric had fixed leaks that used to mask the breakage".

The function essentially creates a copy of old array sans one element
and replaces the references to elements of original (they are on cyclic
lists) with those to corresponding elements of new one.  After that the
old one is fair game for freeing.

First of all, there's a dumb braino: when we get to list_replace_init we
use indices for wrong arrays - position in new one with the old array
and vice versa.

Another bug is more subtle - termination condition is wrong if the
element to be excluded happens to be the last one.  We shouldn't go
until we fill the new array, we should go until we'd finished the old
one.  Otherwise the element we are trying to kill will remain on the
cyclic lists...

That crap used to be masked by several leaks, so it was not quite
trivial to hit.  Eric had fixed some of those leaks a while ago and the
shit had hit the fan...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-19 09:27:43 -08:00
Linus Torvalds
55db493b65 Merge branch 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  cpumask: rename tsk_cpumask to tsk_cpus_allowed
  cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt
  cpumask: avoid dereferencing struct cpumask
  cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t
  cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c
  cpumask: avoid deprecated function in mm/slab.c
  cpumask: use cpu_online in kernel/perf_event.c
2009-12-17 17:00:20 -08:00
Linus Torvalds
efc8e7f4c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support
  NOMMU: Optimise away the {dac_,}mmap_min_addr tests
  security/min_addr.c: make init_mmap_min_addr() static
  keys: PTR_ERR return of wrong pointer in keyctl_get_security()
2009-12-17 16:58:26 -08:00
Linus Torvalds
dcc7cd0112 Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6
* 'kmemleak' of git://linux-arm.org/linux-2.6:
  kmemleak: fix kconfig for crc32 build error
  kmemleak: Reduce the false positives by checking for modified objects
  kmemleak: Show the age of an unreferenced object
  kmemleak: Release the object lock before calling put_object()
  kmemleak: Scan the _ftrace_events section in modules
  kmemleak: Simplify the kmemleak_scan_area() function prototype
  kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE
2009-12-17 16:00:19 -08:00
Randy Dunlap
6485536bcf printk: fix new kernel-doc warnings
Fix kernel-doc warnings in printk.c:

Warning(kernel/printk.c:1422): No description found for parameter 'dumper'
Warning(kernel/printk.c:1422): Excess function parameter 'dump' description in 'kmsg_dump_register'
Warning(kernel/printk.c:1451): No description found for parameter 'dumper'
Warning(kernel/printk.c:1451): Excess function parameter 'dump' description in 'kmsg_dump_unregister'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-17 15:45:32 -08:00
Oleg Nesterov
9cd80bbb07 do_wait() optimization: do not place sub-threads on task_struct->children list
Thanks to Roland who pointed out de_thread() issues.

Currently we add sub-threads to ->real_parent->children list.  This buys
nothing but slows down do_wait().

With this patch ->children contains only main threads (group leaders).
The only complication is that forget_original_parent() should iterate over
sub-threads by hand, and de_thread() needs another list_replace() when it
changes ->group_leader.

Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check (and we can unify do_wait_thread() and
ptrace_do_wait()).

This change can confuse the optimistic search in mm_update_next_owner(),
but this is fixable and minor.

Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-17 15:45:31 -08:00
WANG Cong
3e26120cc7 kernel/sysctl.c: fix the incomplete part of sysctl_max_map_count-should-be-non-negative.patch
It is a mistake that we used 'proc_dointvec', it should be
'proc_dointvec_minmax', as in the original patch.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-17 15:45:30 -08:00
Linus Torvalds
5a865c0606 Merge branch 'for-33' of git://repo.or.cz/linux-kbuild
* 'for-33' of git://repo.or.cz/linux-kbuild: (29 commits)
  net: fix for utsrelease.h moving to generated
  gen_init_cpio: fixed fwrite warning
  kbuild: fix make clean after mismerge
  kbuild: generate modules.builtin
  genksyms: properly consider  EXPORT_UNUSED_SYMBOL{,_GPL}()
  score: add asm/asm-offsets.h wrapper
  unifdef: update to upstream revision 1.190
  kbuild: specify absolute paths for cscope
  kbuild: create include/generated in silentoldconfig
  scripts/package: deb-pkg: use fakeroot if available
  scripts/package: add KBUILD_PKG_ROOTCMD variable
  scripts/package: tar-pkg: use tar --owner=root
  Kbuild: clean up marker
  net: add net_tstamp.h to headers_install
  kbuild: move utsrelease.h to include/generated
  kbuild: move autoconf.h to include/generated
  drop explicit include of autoconf.h
  kbuild: move compile.h to include/generated
  kbuild: drop include/asm
  kbuild: do not check for include/asm-$ARCH
  ...

Fixed non-conflicting clean merge of modpost.c as per comments from
Stephen Rothwell (modpost.c had grown an include of linux/autoconf.h
that needed to be changed to generated/autoconf.h)
2009-12-17 07:23:42 -08:00
Peter Zijlstra
077614ee1e sched: Fix broken assertion
There's a preemption race in the set_task_cpu() debug check in
that when we get preempted after setting task->state we'd still
be on the rq proper, but fail the test.

Check for preempted tasks, since those are always on the RQ.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20091217121830.137155561@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 13:22:46 +01:00
Peter Zijlstra
5d27c23df0 perf events: Dont report side-band events on each cpu for per-task-per-cpu events
Acme noticed that his FORK/MMAP numbers were inflated by about
the same factor as his cpu-count.

This led to the discovery of a few more sites that need to
respect the event->cpu filter.

Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091217121830.215333434@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 13:21:36 +01:00
Frederic Weisbecker
61c1917f47 perf events, x86/stacktrace: Make stack walking optional
The current print_context_stack helper that does the stack
walking job is good for usual stacktraces as it walks through
all the stack and reports even addresses that look unreliable,
which is nice when we don't have frame pointers for example.

But we have users like perf that only require reliable
stacktraces, and those may want a more adapted stack walker, so
lets make this function a callback in stacktrace_ops that users
can tune for their needs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1261024834-5336-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 09:56:19 +01:00
Frederic Weisbecker
234da7bcdc sched: Teach might_sleep() about preemptible RCU
In practice, it is harmless to voluntarily sleep in a
rcu_read_lock() section if we are running under preempt rcu, but
it is illegal if we build a kernel running non-preemptable rcu.

Currently, might_sleep() doesn't notice sleepable operations
under rcu_read_lock() sections if we are running under
preemptable rcu because preempt_count() is left untouched after
rcu_read_lock() in this case. But we want developers who test
their changes under such config to notice the "sleeping while
atomic" issues.

So we add rcu_read_lock_nesting to prempt_count() in
might_sleep() checks.

[ v2: Handle rcu-tiny ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1260991265-8451-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 09:46:44 +01:00
Masami Hiramatsu
6f3cf44047 kprobe-tracer: Check new event/group name
Check new event/group name is same syntax as a C symbol. In other
words, checking the name is as like as other tracepoint events.

This can prevent user to create an event with useless name (e.g.
foo|bar, foo*bar).

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
LKML-Reference: <20091216222408.14459.68790.stgit@dhcp-100-2-132.bos.redhat.com>
[ v2: minor cleanups ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 09:42:44 +01:00
Ingo Molnar
416eb39556 sched: Make warning less noisy
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.807938893@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-17 06:05:49 +01:00
Rusty Russell
62ac127950 cpumask: avoid dereferencing struct cpumask
struct cpumask will be undefined soon with CONFIG_CPUMASK_OFFSTACK=y,
to avoid them being declared on the stack.

cpumask_bits() does what we want here (of course, this code is crap).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: Thomas Gleixner <tglx@linutronix.de>
2009-12-17 11:43:29 +10:30
Rusty Russell
f6325e30eb cpumask: use cpu_online in kernel/perf_event.c
Also, we want to check against nr_cpu_ids, not num_possible_cpus().
The latter works, but the correct bounds check is < nr_cpu_ids.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: Thomas Gleixner <tglx@linutronix.de>
2009-12-17 11:43:11 +10:30
Simon Horman
cf1e367ee8 timers: Remove duplicate setting of new_base in __mod_timer()
new_base is set using per_cpu(tvec_bases, cpu) after selecting the
desired value of cpu immediately below so this line is a unnecessary.

Signed-off-by: Simon Horman <horms@verge.net.au>
LKML-Reference: <20091217001542.GD25317@verge.net.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-12-17 01:30:49 +01:00
David Howells
6e14154676 NOMMU: Optimise away the {dac_,}mmap_min_addr tests
In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
skipped by the compiler.  We do this as the minimum mmap address doesn't make
any sense in NOMMU mode.

mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-12-17 09:25:19 +11:00
Andi Kleen
61cf693159 [sysctl] Fix breakage on systems with older glibc
As predicted during code review, the sysctl(2) changes made systems with
old glibc nearly unusable.  About every command gives a:

  warning: process `ls' used the deprecated sysctl system call with 1.4

warning in the log.

I see this on a SUSE 10.0 system with glibc 2.3.5.

Don't warn for this common case.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 12:36:18 -08:00
Linus Torvalds
8aedf8a6ae Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (52 commits)
  perf record: Use per-task-per-cpu events for inherited events
  perf record: Properly synchronize child creation
  perf events: Allow per-task-per-cpu counters
  perf diff: Percent calcs should use double values
  perf diff: Change the default sort order to "dso,symbol"
  perf diff: Use perf_session__fprintf_hists just like 'perf record'
  perf report: Fix cut'n'paste error recently introduced
  perf session: Move perf report specific hits out of perf_session__fprintf_hists
  perf tools: Move hist entries printing routines from perf report
  perf report: Generalize perf_session__fprintf_hists()
  perf symbols: Move symbol filtering to event__preprocess_sample()
  perf symbols: Adopt the strlists for dso, comm
  perf symbols: Make symbol_conf global
  perf probe: Fix to show which probe point is not found
  perf probe: Check symbols in symtab/kallsyms
  perf probe: Check build-id of vmlinux
  perf probe: Reject second attempt of adding same-name event
  perf probe: Support event name for --add option
  perf probe: Add glob matching support on --del
  perf probe: Use strlist__for_each macros in probe-event.c
  ...
2009-12-16 12:32:47 -08:00
Linus Torvalds
da184a8064 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Fix return of trace_dump_stack()
  ksym_tracer: Fix bad cast
  tracing/power: Remove two exports
  tracing: Change event->profile_count to be int type
  tracing: Simplify trace_option_write()
  tracing: Remove useless trace option
  tracing: Use seq file for trace_clock
  tracing: Use seq file for trace_options
  function-graph: Allow writing the same val to set_graph_function
  ftrace: Call trace_parser_clear() properly
  ftrace: Return EINVAL when writing invalid val to set_ftrace_filter
  tracing: Move a printk out of ftrace_raw_reg_event_foo()
  tracing: Pull up calls to trace_define_common_fields()
  tracing: Extract duplicate ftrace_raw_init_event_foo()
  ftrace.h: Use common pr_info fmt string
  tracing: Add stack trace to irqsoff tracer
  tracing: Add trace_dump_stack()
  ring-buffer: Move resize integrity check under reader lock
  ring-buffer: Use sync sched protection on ring buffer resizing
  tracing: Fix wrong usage of strstrip in trace_ksyms
2009-12-16 12:02:25 -08:00
Linus Torvalds
74f3ae7434 Merge branch 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  modpost: fix segfault with short symbol names
  module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y
  Kbuild: clear marker out of modpost
  module: make MODULE_SYMBOL_PREFIX into a CONFIG option
  ARM: unexport symbols used to implement floating point emulation
  ARM: use unified discard definition in linker script
  x86: don't export inline function
  sparc64: don't export static inline pci_ functions
2009-12-16 10:47:24 -08:00
Linus Torvalds
60d9aa758c Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (90 commits)
  jffs2: Fix long-standing bug with symlink garbage collection.
  mtd: OneNAND: Fix test of unsigned in onenand_otp_walk()
  mtd: cfi_cmdset_0002, fix lock imbalance
  Revert "mtd: move mxcnd_remove to .exit.text"
  mtd: m25p80: add support for Macronix MX25L4005A
  kmsg_dump: fix build for CONFIG_PRINTK=n
  mtd: nandsim: add support for 4KiB pages
  mtd: mtdoops: refactor as a kmsg_dumper
  mtd: mtdoops: make record size configurable
  mtd: mtdoops: limit the maximum mtd partition size
  mtd: mtdoops: keep track of used/unused pages in an array
  mtd: mtdoops: several minor cleanups
  core: Add kernel message dumper to call on oopses and panics
  mtd: add ARM pismo support
  mtd: pxa3xx_nand: Fix PIO data transfer
  mtd: nand: fix multi-chip suspend problem
  mtd: add support for switching old SST chips into QRY mode
  mtd: fix M29W800D dev_id and uaddr
  mtd: don't use PF_MEMALLOC
  mtd: Add bad block table overrides to Davinci NAND driver
  ...

Fixed up conflicts (mostly trivial) in
	drivers/mtd/devices/m25p80.c
	drivers/mtd/maps/pcmciamtd.c
	drivers/mtd/nand/pxa3xx_nand.c
	kernel/printk.c
2009-12-16 10:23:43 -08:00
Peter Zijlstra
738d2be430 sched: Simplify set_task_cpu()
Rearrange code a bit now that its a simpler function.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.269101883@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:59 +01:00
Peter Zijlstra
88ec22d3ed sched: Remove the cfs_rq dependency from set_task_cpu()
In order to remove the cfs_rq dependency from set_task_cpu() we
need to ensure the task is cfs_rq invariant for all callsites.

The simple approach is to substract cfs_rq->min_vruntime from
se->vruntime on dequeue, and add cfs_rq->min_vruntime on
enqueue.

However, this has the downside of breaking FAIR_SLEEPERS since
we loose the old vruntime as we only maintain the relative
position.

To solve this, we observe that we only migrate runnable tasks,
we do this using deactivate_task(.sleep=0) and
activate_task(.wakeup=0), therefore we can restrain the
min_vruntime invariance to that state.

The only other case is wakeup balancing, since we want to
maintain the old vruntime we cannot make it relative on dequeue,
but since we don't migrate inactive tasks, we can do so right
before we activate it again.

This is where we need the new pre-wakeup hook, we need to call
this while still holding the old rq->lock. We could fold it into
->select_task_rq(), but since that has multiple callsites and
would obfuscate the locking requirements, that seems like a
fudge.

This leaves the fork() case, simply make sure that ->task_fork()
leaves the ->vruntime in a relative state.

This covers all cases where set_task_cpu() gets called, and
ensures it sees a relative vruntime.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.191697025@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:58 +01:00
Peter Zijlstra
efbbd05a59 sched: Add pre and post wakeup hooks
As will be apparent in the next patch, we need a pre wakeup hook
for sched_fair task migration, hence rename the post wakeup hook
and one pre wakeup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.114746117@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:58 +01:00
Peter Zijlstra
881232b70b sched: Move kthread_bind() back to kthread.c
Since kthread_bind() lost its dependencies on sched.c, move it
back where it came from.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.039524041@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:57 +01:00
Peter Zijlstra
5da9a0fb67 sched: Fix select_task_rq() vs hotplug issues
Since select_task_rq() is now responsible for guaranteeing
->cpus_allowed and cpu_active_mask, we need to verify this.

select_task_rq_rt() can blindly return
smp_processor_id()/task_cpu() without checking the valid masks,
select_task_rq_fair() can do the same in the rare case that all
SD_flags are disabled.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.961475466@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:57 +01:00
Peter Zijlstra
3802290628 sched: Fix sched_exec() balancing
Since we access ->cpus_allowed without holding rq->lock we need
a retry loop to validate the result, this comes for near free
when we merge sched_migrate_task() into sched_exec() since that
already does the needed check.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.884743662@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:56 +01:00
Peter Zijlstra
e2912009fb sched: Ensure set_task_cpu() is never called on blocked tasks
In order to clean up the set_task_cpu() rq dependencies we need
to ensure it is never called on blocked tasks because such usage
does not pair with consistent rq->lock usage.

This puts the migration burden on ttwu().

Furthermore we need to close a race against changing
->cpus_allowed, since select_task_rq() runs with only preemption
disabled.

For sched_fork() this is safe because the child isn't in the
tasklist yet, for wakeup we fix this by synchronizing
set_cpus_allowed_ptr() against TASK_WAKING, which leaves
sched_exec to be a problem

This also closes a hole in (6ad4c1888 sched: Fix balance vs
hotplug race) where ->select_task_rq() doesn't validate the
result against the sched_domain/root_domain.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.807938893@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:56 +01:00
Peter Zijlstra
06b83b5fbe sched: Use TASK_WAKING for fork wakups
For later convenience use TASK_WAKING for fresh tasks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.732561278@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:55 +01:00
Peter Zijlstra
e4f4288842 sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE
We should skip !SD_LOAD_BALANCE domains.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.653578430@chello.nl>
CC: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:55 +01:00
Peter Zijlstra
e6c8fba777 sched: Fix task_hot() test order
Make sure not to access sched_fair fields before verifying it is
indeed a sched_fair task.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
CC: stable@kernel.org
LKML-Reference: <20091216170517.577998058@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:54 +01:00
Xiaotian Feng
9ee349ad6d sched: Fix set_cpu_active() in cpu_down()
Sachin found cpu hotplug test failures on powerpc, which made
the kernel hang on his POWER box.

The problem is that we fail to re-activate a cpu when a
hot-unplug fails. Fix this by moving the de-activation into
_cpu_down after doing the initial checks.

Remove the synchronize_sched() calls and rely on those implied
by rebuilding the sched domains using the new mask.

Reported-by: Sachin Sant <sachinp@in.ibm.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Tested-by: Sachin Sant <sachinp@in.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170517.500272612@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:53 +01:00
Ingo Molnar
ee1156c11a Merge branch 'linus' into sched/urgent
Conflicts:
	kernel/sched_idletask.c

Merge reason: resolve the conflicts, pick up latest changes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 18:33:49 +01:00
Peter Zijlstra
f4c4176f21 perf events: Allow per-task-per-cpu counters
In order to allow for per-task-per-cpu counters, useful for
scalability when profiling task hierarchies, we allow installing
events with event->cpu != -1 in task contexts.

__perf_event_sched_in() already skips events where ->cpu
mis-matches the current cpu, fix up __perf_install_in_context()
and __perf_event_enable() to also respect this filter.

This does lead to vary hard to interpret enabled/running times
for such counters, but I don't see a simple solution for that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: fweisbec@gmail.com
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20091216165904.831451147@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 18:30:11 +01:00
Amerigo Wang
06a7f71124 kexec: premit reduction of the reserved memory size
Implement shrinking the reserved memory for crash kernel, if it is more
than enough.

For example, if you have already reserved 128M, now you just want 100M,
you can do:

# echo $((100*1024*1024)) > /sys/kernel/kexec_crash_size

Note, you can only do this before loading the crash kernel.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:13 -08:00
André Goddard Rosa
417e315247 pid: reduce code size by using a pointer to iterate over array
It decreases code size by 16 bytes on my gcc 4.4.1 on Core 2:
  text    data     bss     dec     hex filename
  4314    2216       8    6538    198a kernel/pid.o-BEFORE
  4298    2216       8    6522    197a kernel/pid.o-AFTER

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:12 -08:00
André Goddard Rosa
7be6d991bc pid: tighten pidmap spinlock critical section by removing kfree()
Avoid calling kfree() under pidmap spinlock, calling it afterwards.

Normally kfree() is fast, but sometimes it can be slow, so avoid
calling it under the spinlock if we can do it.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:12 -08:00
Oleg Nesterov
1be53963b0 signals: check ->group_stop_count after tracehook_get_signal()
Move the call to do_signal_stop() down, after tracehook call.  This makes
->group_stop_count condition visible to tracers before do_signal_stop()
will participate in this group-stop.

Currently the patch has no effect, tracehook_get_signal() always returns 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:09 -08:00
Oleg Nesterov
ad09750b51 signals: kill force_sig_specific()
Kill force_sig_specific(), this trivial wrapper has no callers.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:09 -08:00
Oleg Nesterov
7486e5d9fc signals: cosmetic, collect_signal: use SI_USER
Trivial, s/0/SI_USER/ in collect_signal() for grep.

This is a bit confusing, we don't know the source of this signal.
But we don't care, and "info->si_code = 0" is imho worse.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:09 -08:00
Oleg Nesterov
dd34200adc signals: send_signal: use si_fromuser() to detect from_ancestor_ns
Change send_signal() to use si_fromuser().  From now SEND_SIG_NOINFO
triggers the "from_ancestor_ns" check.

This fixes reparent_thread()->group_send_sig_info(pdeath_signal)
behaviour, before this patch send_signal() does not detect the
cross-namespace case when the child of the dying parent belongs to the
sub-namespace.

This patch can affect the behaviour of send_sig(), kill_pgrp() and
kill_pid() when the caller sends the signal to the sub-namespace with
"priv == 0" but surprisingly all callers seem to use them correctly,
including disassociate_ctty(on_exit).

Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use
send_sig(priv => 0).  But his is minor and should be fixed anyway.

Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:09 -08:00
Oleg Nesterov
614c517d7c signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER()
No changes in compiled code. The patch adds the new helper, si_fromuser()
and changes check_kill_permission() to use this helper.

The real effect of this patch is that from now we "officially" consider
SEND_SIG_NOINFO signal as "from user-space" signals. This is already true
if we look at the code which uses SEND_SIG_NOINFO, except __send_signal()
has another opinion - see the next patch.

The naming of these special SEND_SIG_XXX siginfo's is really bad
imho.  From __send_signal()'s pov they mean

	SEND_SIG_NOINFO		from user
	SEND_SIG_PRIV		from kernel
	SEND_SIG_FORCED		no info

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:08 -08:00
Oleg Nesterov
6580807da1 ptrace: copy_process() should disable stepping
If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child
starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent.
This is not right, especially when the new child is not auto-attaced: in
this case it is killed by SIGTRAP.

Change copy_process() to call user_disable_single_step(). Tested on x86.

Test-case:

	#include <stdio.h>
	#include <unistd.h>
	#include <signal.h>
	#include <sys/ptrace.h>
	#include <sys/wait.h>
	#include <assert.h>

	int main(void)
	{
		int pid, status;

		if (!(pid = fork())) {
			assert(ptrace(PTRACE_TRACEME) == 0);
			kill(getpid(), SIGSTOP);

			if (!fork()) {
				/* kernel bug: this child will be killed by SIGTRAP */
				printf("Hello world\n");
				return 43;
			}

			wait(&status);
			return WEXITSTATUS(status);
		}

		for (;;) {
			assert(pid == wait(&status));
			if (WIFEXITED(status))
				break;
			assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
		}

		assert(WEXITSTATUS(status) == 43);
		return 0;
	}

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:08 -08:00
KAMEZAWA Hiroyuki
569b846df5 memcg: coalesce uncharge during unmap/truncate
In massive parallel enviroment, res_counter can be a performance
bottleneck.  One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.

This patch is a for coalescing uncharge.  For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task.  A reason for per-task batching is for making use
of caller's context information.  We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.

[without memcg config]
  40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
  27.67 cache-miss/faults
[root cgroup]
  36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
  31.58 miss/faults
[in a child cgroup]
  18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
  69.96 miss/faults
[child with this patch]
  27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
  47.16 miss/faults

We can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.

Changelog(since 2009/10/02):
 - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
 - some clean up and commentary/description updates.
 - added initialize code to copy_process(). (possible bug fix)

Changelog(old):
 - fixed !CONFIG_MEM_CGROUP case.
 - rebased onto the latest mmotm + softlimit fix patches.
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:07 -08:00
Alexey Dobriyan
28dfef8feb const: constify remaining pipe_buf_operations
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:05 -08:00
Barry Song
f065f41f48 timecompare: fix half-Y2K38 problem in timecompare_update while calculating offset
ktime will overflow from 03:14:07 UTC on Tuesday, 19 January 2038,
ktime_add() in timecompare_update() will overflow a half earlier.  As a
result, wrong offset will be gotten, then cause some strange problems.

Signed-off-by: Barry Song <21cnbao@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Patrick Ohly <patrick.ohly@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:19:57 -08:00
Peter Zijlstra
f13c12c634 perf_events: Fix perf_event_attr layout
The miss-alignment of bp_addr created a 32bit hole, causing
different structure packings on 32 and 64 bit machines.

Fix that by moving __reserve_2 into that hole.

Further, remove the useless struct and redundant __bp_reserve
muck.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1260902591.8023.781.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15 20:12:20 +01:00
Linus Torvalds
8f0ddf91f2 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
  clockevents: Convert to raw_spinlock
  clockevents: Make tick_device_lock static
  debugobjects: Convert to raw_spinlocks
  perf_event: Convert to raw_spinlock
  hrtimers: Convert to raw_spinlocks
  genirq: Convert irq_desc.lock to raw_spinlock
  smp: Convert smplocks to raw_spinlocks
  rtmutes: Convert rtmutex.lock to raw_spinlock
  sched: Convert pi_lock to raw_spinlock
  sched: Convert cpupri lock to raw_spinlock
  sched: Convert rt_runtime_lock to raw_spinlock
  sched: Convert rq->lock to raw_spinlock
  plist: Make plist debugging raw_spinlock aware
  bkl: Fixup core_lock fallout
  locking: Cleanup the name space completely
  locking: Further name space cleanups
  alpha: Fix fallout from locking changes
  locking: Implement new raw_spinlock
  locking: Convert raw_rwlock functions to arch_rwlock
  locking: Convert raw_rwlock to arch_rwlock
  ...
2009-12-15 09:02:01 -08:00
André Goddard Rosa
e7d2860b69 tree-wide: convert open calls to remove spaces to skip_spaces() lib function
Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.

It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
   text    data     bss     dec     hex filename
  64688     584     592   65864   10148 (TOTALS-BEFORE)
  64641     584     592   65817   10119 (TOTALS-AFTER)

Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".

Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
    drivers/leds/led-class.c
    drivers/leds/ledtrig-timer.c
    drivers/video/output.c

@@
expression str;
@@

( // ignore skip_spaces cases
while (*str &&  isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:32 -08:00
Bernhard Walle
5ada918b82 vt: introduce and use vt_kmsg_redirect() function
The kernel offers with TIOCL_GETKMSGREDIRECT ioctl() the possibility to
redirect the kernel messages to a specific console.

However, since it's not possible to switch to the kernel message console
after a panic(), it would be nice if the kernel would print the panic
message on the current console.

This patch series adds a new interface to access the global kmsg_redirect
variable by a function to be able to use it in code where
CONFIG_VT_CONSOLE is not set (kernel/panic.c).

This patch:

Instead of using and exporting a global value kmsg_redirect, introduce a
function vt_kmsg_redirect() that both can set and return the console where
messages are printed.

Change all users of kmsg_redirect (the VT code itself and kernel/power.c)
to the new interface.

The main advantage is that vt_kmsg_redirect() can also be used when
CONFIG_VT_CONSOLE is not set.

Signed-off-by: Bernhard Walle <bernhard@bwalle.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:28 -08:00
H Hartley Sweeten
dfc6a736d4 kernel/sys.c: fix "warning: do-while statement is not a compound statement" noise
do_each_thread/while_each_thread wrap a block of code that is in this format:

	for (...)
		do
			...
		while

If curly braces do not surround the inner loop the following warning is
generated by sparse:

	warning: do-while statement is not a compound statement

Fix the warning by adding the braces.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:26 -08:00
Xiao Guangrong
c0f68c2fab generic-ipi: cleanup for generic_smp_call_function_interrupt()
Use smp_processor_id() instead of get_cpu() and put_cpu() in
generic_smp_call_function_interrupt(), It's no need to disable preempt,
because we must call generic_smp_call_function_interrupt() with interrupts
disabled.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:25 -08:00
Amerigo Wang
70da2340fb 'sysctl_max_map_count' should be non-negative
Jan Engelhardt reported we have this problem:

setting max_map_count to a value large enough results in programs dying at
first try.  This is on 2.6.31.6:

15:59 borg:/proc/sys/vm # echo $[1<<31-1] >max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
1073741824
15:59 borg:/proc/sys/vm # echo $[1<<31] >max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
Killed

This is because we have a chance to make 'max_map_count' negative.  but
it's meaningless.  Make it only accept non-negative values.

Reported-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: James Morris <jmorris@namei.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:23 -08:00
Lee Schermerhorn
06808b0827 hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced.
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	sysctl vm.nr_hugepages[_mempolicy]=32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:12 -08:00
Alexey Dobriyan
4b731d50ff bsdacct: fix uid/gid misreporting
commit d8e180dcd5 "bsdacct: switch
credentials for writing to the accounting file" introduced credential
switching during final acct data collecting.  However, uid/gid pair
continued to be collected from current which became credentials of who
created acct file, not who exits.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: Juho K. Juopperi <jkj@kapsi.fi>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:10 -08:00
Paul Mackerras
0f624e7e56 perf_event: Fix incorrect range check on cpu number
It is quite legitimate for CPUs to be numbered sparsely, meaning
that it possible for an online CPU to have a number which is
greater than the total count of possible CPUs.

Currently find_get_context() has a sanity check on the cpu
number where it checks it against num_possible_cpus().  This
test can fail for a legitimate cpu number if the
cpu_possible_mask is sparsely populated.

This fixes the problem by checking the CPU number against
nr_cpumask_bits instead, since that is the appropriate check to
ensure that the cpu number is same to pass to cpu_isset()
subsequently.

Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Michael Neuling <mikey@neuling.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
LKML-Reference: <20091215084032.GA18661@brick.ozlabs.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15 13:09:55 +01:00
David Miller
b9f8fcd55b sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
Relax stable-sched-clock architectures to not save/disable/restore
hardirqs in cpu_clock().

The background is that I was trying to resolve a sparc64 perf
issue when I discovered this problem.

On sparc64 I implement pseudo NMIs by simply running the kernel
at IRQ level 14 when local_irq_disable() is called, this allows
performance counter events to still come in at IRQ level 15.

This doesn't work if any code in an NMI handler does
local_irq_save() or local_irq_disable() since the "disable" will
kick us back to cpu IRQ level 14 thus letting NMIs back in and
we recurse.

The only path which that does that in the perf event IRQ
handling path is the code supporting frequency based events.  It
uses cpu_clock().

cpu_clock() simply invokes sched_clock() with IRQs disabled.

And that's a fundamental bug all on it's own, particularly for
the HAVE_UNSTABLE_SCHED_CLOCK case.  NMIs can thus get into the
sched_clock() code interrupting the local IRQ disable code
sections of it.

Furthermore, for the not-HAVE_UNSTABLE_SCHED_CLOCK case, the IRQ
disabling done by cpu_clock() is just pure overhead and
completely unnecessary.

So the core problem is that sched_clock() is not NMI safe, but
we are invoking it from NMI contexts in the perf events code
(via cpu_clock()).

A less important issue is the overhead of IRQ disabling when it
isn't necessary in cpu_clock().

CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures are not
affected by this patch.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091213.182502.215092085.davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15 09:04:36 +01:00
Steven Rostedt
e36c54582c tracing: Fix return of trace_dump_stack()
The trace_dump_stack() returned a value for a void function.

Also, added the missing stub for trace_dump_stack() when tracing is
not configured.

Reported-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20091214162713.GA31060@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-15 08:36:11 +01:00
Rusty Russell
d4703aefdb module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y
powerpc applies relocations to the kcrctab.  They're absolute symbols,
but it's not completely unreasonable: other archs may too, but the
relocation is often 0.

http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-November/077972.html

Inspired-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Paul Mackerras <paulus@samba.org>
2009-12-15 16:28:34 +10:30
Thomas Gleixner
b5f91da0a6 clockevents: Convert to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:34 +01:00
Thomas Gleixner
d192c47f25 clockevents: Make tick_device_lock static
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:34 +01:00
Thomas Gleixner
e625cce1b7 perf_event: Convert to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:34 +01:00
Thomas Gleixner
ecb49d1a63 hrtimers: Convert to raw_spinlocks
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:34 +01:00
Thomas Gleixner
239007b844 genirq: Convert irq_desc.lock to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
9f5a5621e7 smp: Convert smplocks to raw_spinlocks
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
d209d74d52 rtmutes: Convert rtmutex.lock to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
1d61548254 sched: Convert pi_lock to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
fe841226bd sched: Convert cpupri lock to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
0986b11b12 sched: Convert rt_runtime_lock to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
05fa785cf8 sched: Convert rq->lock to raw_spinlock
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
a26724591e plist: Make plist debugging raw_spinlock aware
plists are used with spinlocks and raw_spinlocks. Change the plist
debugging to handle both types.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
9c1721aa49 locking: Cleanup the name space completely
Make the name space hierarchy of locking functions consistent:
     raw_spin* -> _raw_spin* -> __raw_spin*

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
9828ea9d75 locking: Further name space cleanups
The name space hierarchy for the internal lock functions is now a bit
backwards. raw_spin* functions map to _spin* which use __spin*, while
we would like to have _raw_spin* and __raw_spin*.

_raw_spin* is already used by lock debugging, so rename those funtions
to do_raw_spin* to free up the _raw_spin* name space.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:33 +01:00
Thomas Gleixner
c2f21ce2e3 locking: Implement new raw_spinlock
Now that the raw_spin name space is freed up, we can implement
raw_spinlock and the related functions which are used to annotate the
locks which are not converted to sleeping spinlocks in preempt-rt.

A side effect is that only such locks can be used with the low level
lock fsunctions which circumvent lockdep.

For !rt spin_* functions are mapped to the raw_spin* implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:32 +01:00
Thomas Gleixner
0199c4e68d locking: Convert __raw_spin* functions to arch_spin*
Name space cleanup. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org
2009-12-14 23:55:32 +01:00
Thomas Gleixner
edc35bd72e locking: Rename __RAW_SPIN_LOCK_UNLOCKED to __ARCH_SPIN_LOCK_UNLOCKED
Further name space cleanup. No functional change

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org
2009-12-14 23:55:32 +01:00
Thomas Gleixner
445c89514b locking: Convert raw_spinlock to arch_spinlock
The raw_spin* namespace was taken by lockdep for the architecture
specific implementations. raw_spin_* would be the ideal name space for
the spinlocks which are not converted to sleeping locks in preempt-rt.

Linus suggested to convert the raw_ to arch_ locks and cleanup the
name space instead of using an artifical name like core_spin,
atomic_spin or whatever

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: linux-arch@vger.kernel.org
2009-12-14 23:55:32 +01:00
Thomas Gleixner
b7b40ade58 locking: Reorder functions in spinlock.c
Separate spin_lock and rw_lock functions. Preempt-RT needs to exclude
the rw_lock functions from being compiled. The reordering allows to do
that with a single #ifdef.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 23:55:32 +01:00
Linus Torvalds
d0316554d3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
  m68k: rename global variable vmalloc_end to m68k_vmalloc_end
  percpu: add missing per_cpu_ptr_to_phys() definition for UP
  percpu: Fix kdump failure if booted with percpu_alloc=page
  percpu: make misc percpu symbols unique
  percpu: make percpu symbols in ia64 unique
  percpu: make percpu symbols in powerpc unique
  percpu: make percpu symbols in x86 unique
  percpu: make percpu symbols in xen unique
  percpu: make percpu symbols in cpufreq unique
  percpu: make percpu symbols in oprofile unique
  percpu: make percpu symbols in tracer unique
  percpu: make percpu symbols under kernel/ and mm/ unique
  percpu: remove some sparse warnings
  percpu: make alloc_percpu() handle array types
  vmalloc: fix use of non-existent percpu variable in put_cpu_var()
  this_cpu: Use this_cpu_xx in trace_functions_graph.c
  this_cpu: Use this_cpu_xx for ftrace
  this_cpu: Use this_cpu_xx in nmi handling
  this_cpu: Use this_cpu operations in RCU
  this_cpu: Use this_cpu ops for VM statistics
  ...

Fix up trivial (famous last words) global per-cpu naming conflicts in
	arch/x86/kvm/svm.c
	mm/slab.c
2009-12-14 09:58:24 -08:00
Ingo Molnar
0087aabd6a Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent 2009-12-14 17:12:37 +01:00
Thomas Gleixner
1a551ae715 sched: Use rcu in sched_get_rr_param()
read_lock(&tasklist_lock) does not protect
sys_sched_get_rr_param() against a concurrent update of the
policy or scheduler parameters as do_sched_scheduler() does not
take the tasklist_lock.

The access to task->sched_class->get_rr_interval is protected by
task_rq_lock(task).

Use rcu_read_lock() to protect find_task_by_vpid() and prevent
the task struct from going away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091209100706.862897167@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 17:11:35 +01:00
Thomas Gleixner
23f5d14251 sched: Use rcu in sched_get/set_affinity()
tasklist_lock is held read locked to protect the
find_task_by_vpid() call and to prevent the task going away.
sched_setaffinity acquires a task struct ref and drops tasklist
lock right away. The access to the cpus_allowed mask is
protected by rq->lock.

rcu_read_lock() provides the same protection here.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091209100706.789059966@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 17:11:35 +01:00
Thomas Gleixner
5fe85be081 sched: Use rcu in sys_sched_getscheduler/sys_sched_getparam()
read_lock(&tasklist_lock) does not protect
sys_sched_getscheduler and sys_sched_getparam() against a
concurrent update of the policy or scheduler parameters as
do_sched_setscheduler() does not take the tasklist_lock. The
accessed integers can be retrieved w/o locking and are snapshots
anyway.

Using rcu_read_lock() to protect find_task_by_vpid() and prevent
the task struct from going away is not changing the above
situation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091209100706.753790977@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 17:11:34 +01:00
Ingo Molnar
cc0104e877 Merge branch 'linus' into tracing/urgent
Conflicts:
	kernel/trace/trace_kprobe.c

Merge reason: resolve the conflict.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-14 09:16:49 +01:00
Li Zefan
16620e0f19 ksym_tracer: Fix bad cast
Fix this warning:

kernel/trace/trace_ksym.c: In function 'ksym_trace_filter_read':
kernel/trace/trace_ksym.c:239: warning: cast to pointer from integer of different size

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
LKML-Reference: <4B1DC578.9020909@cn.fujitsu.com>
[remove the strstrip fix as tglx already fixed that]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:46:54 +01:00
Li Zefan
472bbe02c9 tracing/power: Remove two exports
trace_power_start and trace_power_end are used in
arch/x86/kernel/power.c, and this file can't be compiled
as a module, so these two tracepoints don't need to be
exported.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC55F.7060305@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:28 +01:00
Li Zefan
e00bf2ec60 tracing: Change event->profile_count to be int type
Like total_profile_count, struct ftrace_event_call::profile_count
is protected by event_mutex, so it doesn't need to be atomic_t.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <4B1DC549.5010705@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:28 +01:00
Li Zefan
8d18eaaff5 tracing: Simplify trace_option_write()
- remove duplicate code inside trace_options_write()
- extract duplicate code in trace_options_write() and set_tracer_option()

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC532.9010802@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:28 +01:00
Li Zefan
2cbafd68b8 tracing: Remove useless trace option
Since commit 4d9493c90f
("ftrace: remove add-hoc code"), option "sched-tree"
has become useless.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC50A.7040402@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:27 +01:00
Li Zefan
13f16d2091 tracing: Use seq file for trace_clock
The buffer for the output is as small as 64 bytes, so it'll
overflow if we add more clock type. Use seq file instead.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC4FB.5030407@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:27 +01:00
Li Zefan
fdb372ed4c tracing: Use seq file for trace_options
Code simplification for reading trace_options.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-reference: <4B1DC4EF.3090106@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:27 +01:00
Li Zefan
91baf6285b function-graph: Allow writing the same val to set_graph_function
# echo 'do_open' > set_graph_function
 # echo 'do_open' >> set_graph_function
 bash: echo: write error: Invalid argument

Make it valid to write the same value to set_graph_function,
which is consistent with set_ftrace_filter interface.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-reference: <4B1DC4E1.1060303@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:26 +01:00
Li Zefan
313254a940 ftrace: Call trace_parser_clear() properly
I found a weird behavior:

  # echo 'fuse:*' > set_ftrace_filter
  bash: echo: write error: Invalid argument
  # cat set_ftrace_filter
  fuse_dev_fasync
  fuse_dev_poll
  fuse_copy_do

We should call trace_parser_clear() no matter ftrace_process_regex()
returns 0 or -errno, otherwise we will actually take the unaccepted
records from ftrace_regex_release().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC4D2.3000406@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:26 +01:00
Li Zefan
311d16da57 ftrace: Return EINVAL when writing invalid val to set_ftrace_filter
Currently it doesn't warn user on invald value:

 # echo nonexist_symbol > set_ftrace_filter
or:
 # echo 'nonexist_symbol:mod:fuse' > set_ftrace_filter

Better make it return failure.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B1DC4BF.2070003@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:25 +01:00
Li Zefan
3b8e427381 tracing: Move a printk out of ftrace_raw_reg_event_foo()
Move the printk from each ftrace_raw_reg_event_foo() to
its caller ftrace_event_enable_disable(). This avoids each
regfunc trace event callbacks to handle a same error report
that can be carried from the caller.

See how much space this saves:

   text    data     bss     dec     hex filename
5345151 1961864 7103260 14410275         dbe223 vmlinux.o.old
5331487 1961864 7103260 14396611         dbacc3 vmlinux.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <4B1DC4AC.802@cn.fujitsu.com>
[start cmdline record before calling regfunc to avoid lost
window of pid to comm resolution]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:37:25 +01:00
Li Zefan
614a71a26b tracing: Pull up calls to trace_define_common_fields()
Call trace_define_common_fields() in event_create_dir() only.
This avoids trace events to handle it from their define_fields
callbacks and shrinks the kernel code size:

   text    data     bss     dec     hex filename
5346802 1961864 7103260 14411926         dbe896 vmlinux.o.old
5345151 1961864 7103260 14410275         dbe223 vmlinux.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <4B1DC49C.8000107@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:34:23 +01:00
Li Zefan
87d9b4e1c5 tracing: Extract duplicate ftrace_raw_init_event_foo()
Use a generic trace_event_raw_init() function for all event's raw_init
callbacks (but kprobes) instead of defining the same version for each
of these.
This shrinks the kernel code:

   text    data     bss     dec     hex filename
5355293 1961928 7103260 14420481         dc0a01 vmlinux.o.old
5346802 1961864 7103260 14411926         dbe896 vmlinux.o

raw_init can't be removed, because ftrace events and kprobe events
use different raw_init callbacks. Though it's possible to totally
remove raw_init, I choose to leave it as it is for now.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <4B1DC48C.7080603@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-12-13 18:34:23 +01:00
Joe Perches
663997d417 sched: Use pr_fmt() and pr_<level>()
- Convert printk(KERN_<level> to pr_<level> (not KERN_DEBUG)
 - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 - Coalesce long format strings
 - Add missing \n to "ERROR: !SD_LOAD_BALANCE domain has parent"

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1260655047.2637.7.camel@Joe-Laptop.home>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-13 08:13:55 +01:00
Rafael J. Wysocki
7539a3b3d1 sched: Make wakeup side and atomic variants of completion API irq safe
Alan Stern noticed that all the wakeup side (and atomic) variants of the
completion APIs should be irq safe, but the newly introduced
completion_done() and try_wait_for_completion() aren't. The use of the
irq unsafe variants in IRQ contexts can cause crashes/hangs.

Fix the problem by making them use spin_lock_irqsave() and
spin_lock_irqrestore().

Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Chinner <david@fromorbit.com>
Cc: Lachlan McIlroy <lachlan@sgi.com>
LKML-Reference: <200912130007.30541.rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-13 08:12:46 +01:00
Linus Torvalds
702a7c7609 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (21 commits)
  sched: Remove forced2_migrations stats
  sched: Fix memory leak in two error corner cases
  sched: Fix build warning in get_update_sysctl_factor()
  sched: Update normalized values on user updates via proc
  sched: Make tunable scaling style configurable
  sched: Fix missing sched tunable recalculation on cpu add/remove
  sched: Fix task priority bug
  sched: cgroup: Implement different treatment for idle shares
  sched: Remove unnecessary RCU exclusion
  sched: Discard some old bits
  sched: Clean up check_preempt_wakeup()
  sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call
  sched: Sanitize fork() handling
  sched: Clean up ttwu() rq locking
  sched: Remove rq->clock coupling from set_task_cpu()
  sched: Consolidate select_task_rq() callers
  sched: Remove sysctl.sched_features
  sched: Protect sched_rr_get_param() access to task->sched_class
  sched: Protect task->cpus_allowed access in sched_getaffinity()
  sched: Fix balance vs hotplug race
  ...

Fixed up conflicts in kernel/sysctl.c (due to sysctl cleanup)
2009-12-12 11:34:10 -08:00
Sam Ravnborg
273b281fa2 kbuild: move utsrelease.h to include/generated
Fix up all users of utsrelease.h

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2009-12-12 13:08:15 +01:00
Sam Ravnborg
01fc0ac198 kbuild: move bounds.h to include/generated
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2009-12-12 13:08:14 +01:00
Linus Torvalds
3070f27d6e Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  itimer: Fix the itimer trace print format
  hrtimer: move timer stats helper functions to hrtimer.c
  hrtimer: Tune hrtimer_interrupt hang logic
2009-12-11 20:49:09 -08:00
Linus Torvalds
1e57c2186f Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: Avoid out of bounds array reference in save_trace()
  futex: Take mmap_sem for get_user_pages in fault_in_user_writeable
  lockstat: Add usage info to Documentation/lockstat.txt
  lockstat: Fix min, max times in /proc/lock_stats
2009-12-11 20:48:21 -08:00
Linus Torvalds
df7147b3c3 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Remove comparing of NULL to va_list in trace_array_vprintk()
  tracing: Fix function graph trace_pipe to properly display failed entries
  tracing: Add full state to trace_seq
  tracing: Buffer the output of seq_file in case of filled buffer
  tracing: Only call pipe_close if pipe_close is defined
  tracing: Add pipe_close interface
2009-12-11 20:47:44 -08:00
Linus Torvalds
6f696eb17b Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (57 commits)
  x86, perf events: Check if we have APIC enabled
  perf_event: Fix variable initialization in other codepaths
  perf kmem: Fix unused argument build warning
  perf symbols: perf_header__read_build_ids() offset'n'size should be u64
  perf symbols: dsos__read_build_ids() should read both user and kernel buildids
  perf tools: Align long options which have no short forms
  perf kmem: Show usage if no option is specified
  sched: Mark sched_clock() as notrace
  perf sched: Add max delay time snapshot
  perf tools: Correct size given to memset
  perf_event: Fix perf_swevent_hrtimer() variable initialization
  perf sched: Fix for getting task's execution time
  tracing/kprobes: Fix field creation's bad error handling
  perf_event: Cleanup for cpu_clock_perf_event_update()
  perf_event: Allocate children's perf_event_ctxp at the right time
  perf_event: Clean up __perf_event_init_context()
  hw-breakpoints: Modify breakpoints without unregistering them
  perf probe: Update perf-probe document
  perf probe: Support --del option
  trace-kprobe: Support delete probe syntax
  ...
2009-12-11 20:47:30 -08:00
Linus Torvalds
0f4974c439 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (58 commits)
  tty: split the lock up a bit further
  tty: Move the leader test in disassociate
  tty: Push the bkl down a bit in the hangup code
  tty: Push the lock down further into the ldisc code
  tty: push the BKL down into the handlers a bit
  tty: moxa: split open lock
  tty: moxa: Kill the use of lock_kernel
  tty: moxa: Fix modem op locking
  tty: moxa: Kill off the throttle method
  tty: moxa: Locking clean up
  tty: moxa: rework the locking a bit
  tty: moxa: Use more tty_port ops
  tty: isicom: fix deadlock on shutdown
  tty: mxser: Use the new locking rules to fix setserial properly
  tty: mxser: use the tty_port_open method
  tty: isicom: sort out the board init logic
  tty: isicom: switch to the new tty_port_open helper
  tty: tty_port: Add a kref object to the tty port
  tty: istallion: tty port open/close methods
  tty: stallion: Convert to the tty_port_open/close methods
  ...
2009-12-11 15:34:40 -08:00
Linus Torvalds
880188b243 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
  kgdb: Always process the whole breakpoint list on activate or deactivate
  kgdb: continue and warn on signal passing from gdb
  kgdb,x86: do not set kgdb_single_step on x86
  kgdb: allow for cpu switch when single stepping
  kgdb,i386: Fix corner case access to ss with NMI watch dog exception
  kgdb: Replace strstr() by strchr() for single-character needles
  kgdbts: Read buffer overflow
  kgdb: Read buffer overflow
  kgdb,x86: remove redundant test
2009-12-11 15:19:56 -08:00
Alan Cox
5ec93d1154 tty: Move the leader test in disassociate
There are two call points, both want to check that tty->signal->leader is
set. Move the test into disassociate_ctty() as that will make locking
changes easier in a bit

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-11 15:18:08 -08:00
Linus Torvalds
11bd04f6f3 Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (109 commits)
  PCI: fix coding style issue in pci_save_state()
  PCI: add pci_request_acs
  PCI: fix BUG_ON triggered by logical PCIe root port removal
  PCI: remove ifdefed pci_cleanup_aer_correct_error_status
  PCI: unconditionally clear AER uncorr status register during cleanup
  x86/PCI: claim SR-IOV BARs in pcibios_allocate_resource
  PCI: portdrv: remove redundant definitions
  PCI: portdrv: remove unnecessary struct pcie_port_data
  PCI: portdrv: minor cleanup for pcie_port_device_register
  PCI: portdrv: add missing irq cleanup
  PCI: portdrv: enable device before irq initialization
  PCI: portdrv: cleanup service irqs initialization
  PCI: portdrv: check capabilities first
  PCI: portdrv: move PME capability check
  PCI: portdrv: remove redundant pcie type calculation
  PCI: portdrv: cleanup pcie_device registration
  PCI: portdrv: remove redundant pcie_port_device_probe
  PCI: Always set prefetchable base/limit upper32 registers
  PCI: read-modify-write the pcie device control register when initiating pcie flr
  PCI: show dma_mask bits in /sys
  ...

Fixed up conflicts in:
	arch/x86/kernel/amd_iommu_init.c
	drivers/pci/dmar.c
	drivers/pci/hotplug/acpiphp_glue.c
2009-12-11 12:18:16 -08:00
Steven Rostedt
cc51a0fca6 tracing: Add stack trace to irqsoff tracer
The irqsoff and friends tracers help in finding causes of latency in the
kernel. The also work with the function tracer to show what was happening
when interrupts or preemption are disabled. But the function tracer has
a bit of an overhead and can cause exagerated readings.

Currently, when tracing with /proc/sys/kernel/ftrace_enabled = 0, where the
function tracer is disabled, the information that is provided can end up
being useless. For example, a 2 and a half millisecond latency only showed:

 # tracer: preemptirqsoff
 #
 # preemptirqsoff latency trace v1.1.5 on 2.6.32
 # --------------------------------------------------------------------
 # latency: 2463 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
 #    -----------------
 #    | task: -4242 (uid:0 nice:0 policy:0 rt_prio:0)
 #    -----------------
 #  => started at: _spin_lock_irqsave
 #  => ended at:   remove_wait_queue
 #
 #
 #                  _------=> CPU#
 #                 / _-----=> irqs-off
 #                | / _----=> need-resched
 #                || / _---=> hardirq/softirq
 #                ||| / _--=> preempt-depth
 #                |||| /_--=> lock-depth
 #                |||||/     delay
 #  cmd     pid   |||||| time  |   caller
 #     \   /      ||||||   \   |   /
 hackbenc-4242    2d....    0us!: trace_hardirqs_off <-_spin_lock_irqsave
 hackbenc-4242    2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue
 hackbenc-4242    2...1. 2466us : trace_preempt_on <-remove_wait_queue

The above lets us know that hackbench with pid 2463 grabbed a spin lock
somewhere and enabled preemption at remove_wait_queue. This helps a little
but where this actually happened is not informative.

This patch adds the stack dump to the end of the irqsoff tracer. This provides
the following output:

 hackbenc-4242    2d....    0us!: trace_hardirqs_off <-_spin_lock_irqsave
 hackbenc-4242    2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue
 hackbenc-4242    2...1. 2466us : trace_preempt_on <-remove_wait_queue
 hackbenc-4242    2...1. 2467us : <stack trace>
  => sub_preempt_count
  => _spin_unlock_irqrestore
  => remove_wait_queue
  => free_poll_entry
  => poll_freewait
  => do_sys_poll
  => sys_poll
  => system_call_fastpath

Now we see that the culprit of this latency was the free_poll_entry code.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-11 13:19:51 -05:00
Steven Rostedt
03889384ce tracing: Add trace_dump_stack()
I've been asked a few times about how to find out what is calling
some location in the kernel. One way is to use dynamic function tracing
and implement the func_stack_trace. But this only finds out who is
calling a particular function. It does not tell you who is calling
that function and entering a specific if conditional.

I have myself implemented a quick version of trace_dump_stack() for
this purpose a few times, and just needed it now. This is when I realized
that this would be a good tool to have in the kernel like trace_printk().

Using trace_dump_stack() is similar to dump_stack() except that it
writes to the trace buffer instead and can be used in critical locations.

For example:

@@ -5485,8 +5485,12 @@ need_resched_nonpreemptible:
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev)))
 			prev->state = TASK_RUNNING;
-		else
+		else {
 			deactivate_task(rq, prev, 1);
+			trace_printk("Deactivating task %s:%d\n",
+				     prev->comm, prev->pid);
+			trace_dump_stack();
+		}
 		switch_count = &prev->nvcsw;
 	}

Produces:

           <...>-3249  [001]   296.105269: schedule: Deactivating task ntpd:3249
           <...>-3249  [001]   296.105270: <stack trace>
 => schedule
 => schedule_hrtimeout_range
 => poll_schedule_timeout
 => do_select
 => core_sys_select
 => sys_select
 => system_call_fastpath

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-11 10:38:47 -05:00
Jason Wessel
7f8b7ed6f8 kgdb: Always process the whole breakpoint list on activate or deactivate
This patch fixes 2 edge cases in using kgdb in conjunction with gdb.

1) kgdb_deactivate_sw_breakpoints() should process the entire array of
   breakpoints.  The failure to do so results in breakpoints that you
   cannot remove, because a break point can only be removed if its
   state flag is set to BP_SET.

   The easy way to duplicate this problem is to plant a break point in
   a kernel module and then unload the kernel module.

2) kgdb_activate_sw_breakpoints() should process the entire array of
   breakpoints.  The failure to do so results in missed breakpoints
   when a breakpoint cannot be activated.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2009-12-11 08:43:20 -06:00
Jason Wessel
d625e9c0d7 kgdb: continue and warn on signal passing from gdb
On some architectures for the segv trap, gdb wants to pass the signal
back on continue.  For kgdb this is not the default behavior, because
it can cause the kernel to crash if you arbitrarily pass back a
exception outside of kgdb.

Instead of causing instability, pass a message back to gdb about the
supported kgdb signal passing and execute a standard kgdb continue
operation.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2009-12-11 08:43:19 -06:00
Jason Wessel
028e7b1759 kgdb: allow for cpu switch when single stepping
The kgdb core should not assume that a single step operation of a
kernel thread will complete on the same CPU.  The single step flag is
set at the "thread" level and it is possible in a multi cpu system
that a kernel thread can get scheduled on another cpu the next time it
is run.

As a further safety net in case a slave cpu is hung, the debug master
cpu will try 100 times before giving up and assuming control of the
slave cpus is no longer possible.  It is more useful to be able to get
some information out of kgdb instead of spinning forever.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2009-12-11 08:43:17 -06:00
Jason Wessel
84667d4849 kgdb: Read buffer overflow
Roel Kluin reported an error found with Parfait.  Where we want to
ensure that that kgdb_info[-1] never gets accessed.

Also check to ensure any negative tid does not exceed the size of the
shadow CPU array, else report critical debug context because it is an
internal kgdb failure.

Reported-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2009-12-11 08:43:13 -06:00
Thomas Gleixner
bb6eddf767 clockevents: Prevent clockevent_devices list corruption on cpu hotplug
Xiaotian Feng triggered a list corruption in the clock events list on
CPU hotplug and debugged the root cause.

If a CPU registers more than one per cpu clock event device, then only
the active clock event device is removed on CPU_DEAD. The unused
devices are kept in the clock events device list.

On CPU up the clock event devices are registered again, which means
that we list_add an already enqueued list_head. That results in list
corruption.

Resolve this by removing all devices which are associated to the dead
CPU on CPU_DEAD.

Reported-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Xiaotian Feng <dfeng@redhat.com>
Cc: stable@kernel.org
2009-12-11 10:28:08 +01:00
Steven Rostedt
dd7f594357 ring-buffer: Move resize integrity check under reader lock
While using an application that does splice on the ftrace ring
buffer at start up, I triggered an integrity check failure.

Looking into this, I discovered that resizing the buffer performs
an integrity check after the buffer is resized. This check unfortunately
is preformed after it releases the reader lock. If a reader is
reading the buffer it may cause the integrity check to trigger a
false failure.

This patch simply moves the integrity checker under the protection
of the ring buffer reader lock.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-10 23:20:52 -05:00
Steven Rostedt
184210154b ring-buffer: Use sync sched protection on ring buffer resizing
There was a comment in the ring buffer code that says the calling
layers should prevent tracing or reading of the ring buffer while
resizing. I have discovered that the tracers do not honor this
arrangement.

This patch moves the disabling and synchronizing the ring buffer to
a higher layer during resizing. This guarantees that no writes
are occurring while the resize takes place.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-10 22:54:27 -05:00
Thomas Gleixner
d954fbf0ff tracing: Fix wrong usage of strstrip in trace_ksyms
strstrip returns a pointer to the first non space character, but the
code in parse_ksym_trace_str() ignores that.

strstrip is now must_check and therefor we get the correct warning:
kernel/trace/trace_ksym.c:294: warning:
ignoring return value of ‘strstrip’, declared with attribute warn_unused_result

We are really not interested in leading whitespace here.

Fix that and cleanup the dozen kfree() exit pathes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
2009-12-11 00:01:36 +01:00
Thomas Gleixner
d4581a239a sys: Fix missing rcu protection for __task_cred() access
commit c69e8d9 (CRED: Use RCU to access another task's creds and to
release a task's own creds) added non rcu_read_lock() protected access
to task creds of the target task in set_prio_one().

The comment above the function says:
 * - the caller must hold the RCU read lock

The calling code in sys_setpriority does read_lock(&tasklist_lock) but
not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n.
With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick
interrupt when they see no read side critical section.

There is another instance of __task_cred() in sys_setpriority() itself
which is equally unprotected.

Wrap the whole code section into a rcu read side critical section to
fix this quick and dirty.

Will be revisited in course of the read_lock(&tasklist_lock) -> rcu
crusade.

Oleg noted further:

This also fixes another bug here. find_task_by_vpid() is not safe
without rcu_read_lock(). I do not mean it is not safe to use the
result, just find_pid_ns() by itself is not safe.

Usually tasklist gives enough protection, but if copy_process() fails
it calls free_pid() lockless and does call_rcu(delayed_put_pid().
This means, without rcu lock find_pid_ns() can't scan the hash table
safely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.029784964@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2009-12-10 23:04:11 +01:00
Thomas Gleixner
7cf7db8df0 signals: Fix more rcu assumptions
1) Remove the misleading comment in __sigqueue_alloc() which claims
   that holding a spinlock is equivalent to rcu_read_lock().

2) Add a rcu_read_lock/unlock around the __task_cred() access
   in __sigqueue_alloc()

This needs to be revisited to remove the remaining users of
read_lock(&tasklist_lock) but that's outside the scope of this patch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.269843657@linutronix.de>
2009-12-10 23:04:11 +01:00
Thomas Gleixner
14d8c9f3c0 signal: Fix racy access to __task_cred in kill_pid_info_as_uid()
kill_pid_info_as_uid() accesses __task_cred() without being in a RCU
read side critical section. tasklist_lock is not protecting that when
CONFIG_TREE_PREEMPT_RCU=y.

Convert the whole tasklist_lock section to rcu and use
lock_task_sighand to prevent the exit race.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20091210004703.232302055@linutronix.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
2009-12-10 23:04:11 +01:00
Ingo Molnar
b9889ed1dd sched: Remove forced2_migrations stats
This build warning:

 kernel/sched.c: In function 'set_task_cpu':
 kernel/sched.c:2070: warning: unused variable 'old_rq'

Made me realize that the forced2_migrations stat looks pretty
pointless (and a misnomer) - remove it.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-10 20:32:39 +01:00
Linus Torvalds
d71cb81af3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Add debugobjects support
2009-12-10 09:35:44 -08:00
Xiao Guangrong
5e855db5d8 perf_event: Fix variable initialization in other codepaths
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B20BAA6.7010609@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-10 17:23:02 +01:00
Phil Carmody
dfc12eb26a sched: Fix memory leak in two error corner cases
If the second in each of these pairs of allocations fails, then the
first one will not be freed in the error route out.

Found by a static code analysis tool.

Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1260448177-28448-1-git-send-email-ext-phil.2.carmody@nokia.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-10 14:28:10 +01:00
Heiko Carstens
5f201907df hrtimer: move timer stats helper functions to hrtimer.c
There is no reason to make timer_stats_hrtimer_set_start_info and
friends visible to the rest of the kernel. So move all of them to
hrtimer.c.  Also make timer_stats_hrtimer_set_start_info a static
inline function so it gets inlined and we avoid another function call.
Based on a patch by Thomas Gleixner.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20091210095629.GC4144@osiris.boeblingen.de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-12-10 13:08:11 +01:00
Thomas Gleixner
41d2e49493 hrtimer: Tune hrtimer_interrupt hang logic
The hrtimer_interrupt hang logic adjusts min_delta_ns based on the
execution time of the hrtimer callbacks.

This is error-prone for virtual machines, where a guest vcpu can be
scheduled out during the execution of the callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor), which in can lead to large min_delta_ns rendering the
system unusable.

Replace the current heuristics with something more reliable. Allow the
interrupt code to try 3 times to catch up with the lost time. If that
fails use the total time spent in the interrupt handler to defer the
next timer interrupt so the system can catch up with other things
which got delayed. Limit that deferment to 100ms.

The retry events and the maximum time spent in the interrupt handler
are recorded and exposed via /proc/timer_list

Inspired by a patch from Marcelo.

Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
2009-12-10 13:08:11 +01:00
Mike Galbraith
4ca3ef71f5 sched: Fix build warning in get_update_sysctl_factor()
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <new-submission>
2009-12-10 09:34:50 +01:00
Luck, Tony
ea5b41f9d5 lockdep: Avoid out of bounds array reference in save_trace()
ia64 found this the hard way (because we currently have a stub
for save_stack_trace() that does nothing). But it would be a
good idea to  be cautious in case a real save_stack_trace()
bailed out with an error before it set trace->nr_entries.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: luming.yu@intel.com
LKML-Reference: <4b2024d085302c2a2@agluck-desktop.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-10 08:29:33 +01:00
Ingo Molnar
788d70dce0 Merge branch 'tip/tracing/core3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core 2009-12-10 08:18:41 +01:00
Xiao Guangrong
21140f4d33 perf_event: Fix perf_swevent_hrtimer() variable initialization
fix:

 [<c0477471>] ? printk+0x1d/0x24
 [<c01c98f9>] ? perf_prepare_sample+0x269/0x280
 [<c0149231>] warn_slowpath_common+0x71/0xd0
 [<c01c98f9>] ? perf_prepare_sample+0x269/0x280
 [<c01492aa>] warn_slowpath_null+0x1a/0x20
 [<c01c98f9>] perf_prepare_sample+0x269/0x280
 [<c016e9f3>] ? cpu_clock+0x53/0x90
 [<c01cc368>] __perf_event_overflow+0x2a8/0x300
 [<c01ccc3b>] perf_event_overflow+0x1b/0x30
 [<c01ccccf>] perf_swevent_hrtimer+0x7f/0x120

This is because 'data.raw' variable not initialize.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B208E93.1010801@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-10 07:11:05 +01:00
Linus Torvalds
4ef58d4e2a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits)
  tree-wide: fix misspelling of "definition" in comments
  reiserfs: fix misspelling of "journaled"
  doc: Fix a typo in slub.txt.
  inotify: remove superfluous return code check
  hdlc: spelling fix in find_pvc() comment
  doc: fix regulator docs cut-and-pasteism
  mtd: Fix comment in Kconfig
  doc: Fix IRQ chip docs
  tree-wide: fix assorted typos all over the place
  drivers/ata/libata-sff.c: comment spelling fixes
  fix typos/grammos in Documentation/edac.txt
  sysctl: add missing comments
  fs/debugfs/inode.c: fix comment typos
  sgivwfb: Make use of ARRAY_SIZE.
  sky2: fix sky2_link_down copy/paste comment error
  tree-wide: fix typos "couter" -> "counter"
  tree-wide: fix typos "offest" -> "offset"
  fix kerneldoc for set_irq_msi()
  spidev: fix double "of of" in comment
  comment typo fix: sybsystem -> subsystem
  ...
2009-12-09 19:43:33 -08:00
Thomas Gleixner
86fc80f16e capabilities: Use RCU to protect task lookup in sys_capget
cap_get_target_pid() protects the task lookup with tasklist_lock.
security_capget() is called under tasklist_lock as well but
tasklist_lock does not protect anything there. The capabilities are
protected by RCU already.

So tasklist_lock only protects the lookup and prevents the task going
away, which can be done with rcu_read_lock() as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: James Morris <jmorris@namei.org>
2009-12-10 09:42:48 +11:00
Carsten Emde
f2942487ff tracing: Remove comparing of NULL to va_list in trace_array_vprintk()
Olof Johansson stated the following:

  Comparing a va_list with NULL is bogus. It's supposed to be treated like
  an opaque type and only be manipulated with va_* accessors.

Olof noticed that this code broke the ARM builds:

    kernel/trace/trace.c: In function 'trace_array_vprintk':
    kernel/trace/trace.c:1364: error: invalid operands to binary == (have 'va_list' and 'void *')
    kernel/trace/trace.c: In function 'tracing_mark_write':
    kernel/trace/trace.c:3349: error: incompatible type for argument 3 of 'trace_vprintk'

This patch partly reverts c13d2f7c32 and
re-installs the original mark_printk() mechanism.

Reported-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
LKML-Reference: <4B1BAB74.104@osadl.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-09 14:20:08 -05:00
Jiri Olsa
be1eca3931 tracing: Fix function graph trace_pipe to properly display failed entries
There is a case where the graph tracer might get confused and omits
displaying of a single record.  This applies mostly with the trace_pipe
since it is unlikely that the trace_seq buffer will overflow with the
trace file.

As the function_graph tracer goes through the trace entries keeping a
pointer to the current record:

current ->  func1 ENTRY
            func2 ENTRY
            func2 RETURN
            func1 RETURN

When an function ENTRY is encountered, it moves the pointer to the
next entry to check if the function is a nested or leaf function.

            func1 ENTRY
current ->  func2 ENTRY
            func2 RETURN
            func1 RETURN

If the rest of the writing of the function fills the trace_seq buffer,
then the trace_pipe read will ignore this entry. The next read will
Now start at the current location, but the first entry (func1) will
be discarded.

This patch keeps a copy of the current entry in the iterator private
storage and will keep track of when the trace_seq buffer fills. When
the trace_seq buffer fills, it will reuse the copy of the entry in the
next iteration.

[
  This patch has been largely modified by Steven Rostedt in order to
  clean it up and simplify it. The original idea and concept was from
  Jirka and for that, this patch will go under his name to give him
  the credit he deserves. But because this was modify by Steven Rostedt
  anything wrong with the patch should be blamed on Steven.
]

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1259067458-27143-1-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-09 14:09:06 -05:00
Johannes Berg
d184b31c0e tracing: Add full state to trace_seq
The trace_seq buffer might fill up, and right now one needs to check the
return value of each printf into the buffer to check for that.

Instead, have the buffer keep track of whether it is full or not, and
reject more input if it is full or would have overflowed with an input
that wasn't added.

Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-09 14:05:49 -05:00
Steven Rostedt
a63ce5b306 tracing: Buffer the output of seq_file in case of filled buffer
If the seq_read fills the buffer it will call s_start again on the next
itertation with the same position. This causes a problem with the
function_graph tracer because it consumes the iteration in order to
determine leaf functions.

What happens is that the iterator stores the entry, and the function
graph plugin will look at the next entry. If that next entry is a return
of the same function and task, then the function is a leaf and the
function_graph plugin calls ring_buffer_read which moves the ring buffer
iterator forward (the trace iterator still points to the function start
entry).

The copying of the trace_seq to the seq_file buffer will fail if the
seq_file buffer is full. The seq_read will not show this entry.
The next read by userspace will cause seq_read to again call s_start
which will reuse the trace iterator entry (the function start entry).
But the function return entry was already consumed. The function graph
plugin will think that this entry is a nested function and not a leaf.

To solve this, the trace code now checks the return status of the
seq_printf (trace_print_seq). If the writing to the seq_file buffer
fails, we set a flag in the iterator (leftover) and we do not reset
the trace_seq buffer. On the next call to s_start, we check the leftover
flag, and if it is set, we just reuse the trace_seq buffer and do not
call into the plugin print functions.

Before this patch:

 2)               |      fput() {
 2)               |        __fput() {
 2)   0.550 us    |          inotify_inode_queue_event();
 2)               |          __fsnotify_parent() {
 2)   0.540 us    |          inotify_dentry_parent_queue_event();

After the patch:

 2)               |      fput() {
 2)               |        __fput() {
 2)   0.550 us    |          inotify_inode_queue_event();
 2)   0.548 us    |          __fsnotify_parent();
 2)   0.540 us    |          inotify_dentry_parent_queue_event();

[
  Updated the patch to fix a missing return 0 from the trace_print_seq()
  stub when CONFIG_TRACING is disabled.

  Reported-by: Ingo Molnar <mingo@elte.hu>
]

Reported-by: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-09 13:55:26 -05:00
Steven Rostedt
29bf4a5e3f tracing: Only call pipe_close if pipe_close is defined
This fixes a cut and paste error that had pipe_close get called
if pipe_open was defined (not pipe_close).

Reported-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20091209153204.F4CD.A69D9226@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-09 12:47:35 -05:00
Linus Torvalds
3b8ecd2244 Merge branch 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sys: Remove BKL from sys_reboot
  pm_qos: clean up racy global "name" variable
  pm_qos: remove BKL
2009-12-09 08:07:17 -08:00
Thomas Gleixner
f409adf5b1 futex: Protect pid lookup in compat code with RCU
find_task_by_vpid() in compat_sys_get_robust_list() does not require
tasklist_lock. It can be protected with rcu_read_lock as done in
sys_get_robust_list() already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2009-12-09 14:22:14 +01:00
Frederic Weisbecker
822a696111 tracing/kprobes: Fix field creation's bad error handling
When we define the common event fields in kprobe, we invert the error
handling and return immediately in case of success. Then we omit
to define specific kprobes fields (ip and nargs), and specific
kretprobes fields (func, ret_ip, nargs). And we only define them
when we fail to create common fields.

The most visible consequence is that we can't create filter for
k(ret)probes specific fields.

This patch re-invert the success/error handling to fix it.

Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1260263815-5167-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:32:21 +01:00
Christian Ehrhardt
acb4a848da sched: Update normalized values on user updates via proc
The normalized values are also recalculated in case the scaling factor
changes.

This patch updates the internally used scheduler tuning values that are
normalized to one cpu in case a user sets new values via sysfs.

Together with patch 2 of this series this allows to let user configured
values scale (or not) to cpu add/remove events taking place later.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-4-git-send-email-ehrhardt@linux.vnet.ibm.com>
[ v2: fix warning ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:04:02 +01:00
Christian Ehrhardt
1983a922a1 sched: Make tunable scaling style configurable
As scaling now takes place on all kind of cpu add/remove events a user
that configures values via proc should be able to configure if his set
values are still rescaled or kept whatever happens.

As the comments state that log2 was just a second guess that worked the
interface is not just designed for on/off, but to choose a scaling type.
Currently this allows none, log and linear, but more important it allwos
us to keep the interface even if someone has an even better idea how to
scale the values.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-3-git-send-email-ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:04:01 +01:00
Christian Ehrhardt
0bcdcf28c9 sched: Fix missing sched tunable recalculation on cpu add/remove
Based on Peter Zijlstras patch suggestion this enables recalculation of
the scheduler tunables in response of a change in the number of cpus. It
also adds a max of eight cpus that are considered in that scaling.

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-2-git-send-email-ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:58 +01:00
Peter Zijlstra
57785df5ac sched: Fix task priority bug
83f9ac removed a call to effective_prio() in wake_up_new_task(), which
leads to tasks running at MAX_PRIO.

This is caused by the idle thread being set to MAX_PRIO before forking
off init. O(1) used that to make sure idle was always preempted, CFS
uses check_preempt_curr_idle() for that so we can savely remove this bit
of legacy code.

Reported-by: Mike Galbraith <efault@gmx.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259754383.4003.610.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:10 +01:00
Peter Zijlstra
cd8ad40de3 sched: cgroup: Implement different treatment for idle shares
When setting the weight for a per-cpu task-group, we have to put in a
phantom weight when there is no work on that cpu, otherwise we'll not
service that cpu when new work gets placed there until we again update
the per-cpu weights.

We used to add these phantom weights to the total, so that the idle
per-cpu shares don't get inflated, this however causes the non-idle
parts to get deflated, causing unexpected weight distibutions.

Reverse this, so that the non-idle shares are correct but the idle
shares are inflated.

Reported-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1257934048.23203.76.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:09 +01:00
Peter Zijlstra
fb58bac5c7 sched: Remove unnecessary RCU exclusion
As Nick pointed out, and realized by myself when doing:
   sched: Fix balance vs hotplug race
the patch:
   sched: for_each_domain() vs RCU

is wrong, sched_domains are freed after synchronize_sched(), which
means disabling preemption is enough.

Reported-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:08 +01:00
Peter Zijlstra
6cecd084d0 sched: Discard some old bits
WAKEUP_RUNNING was an experiment, not sure why that ever ended up being
merged...

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:07 +01:00
Peter Zijlstra
3a7e73a2e2 sched: Clean up check_preempt_wakeup()
Streamline the wakeup preemption code a bit, unifying the preempt path
so that they all do the same.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:07 +01:00
Jupyung Lee
a65ac745e4 sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call
If a RT task is woken up while a non-RT task is running,
check_preempt_wakeup() is called to check whether the new task can
preempt the old task. The function returns quickly without going deeper
because it is apparent that a RT task can always preempt a non-RT task.

In this situation, check_preempt_wakeup() always calls update_curr() to
update vruntime value of the currently running task. However, the
function call is unnecessary and redundant at that moment because (1) a
non-RT task can always be preempted by a RT task regardless of its
vruntime value, and (2) update_curr() will be called shortly when the
context switch between two occurs.

By moving update_curr() in check_preempt_wakeup(), we can avoid
redundant call to update_curr(), slightly reducing the time taken to
wake up RT tasks.

Signed-off-by: Jupyung Lee <jupyung@gmail.com>
[ Place update_curr() right before the wake_preempt_entity() call, which
  is the only thing that relies on the updated vruntime ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1258451500-6714-1-git-send-email-jupyung@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:06 +01:00
Peter Zijlstra
cd29fe6f26 sched: Sanitize fork() handling
Currently we try to do task placement in wake_up_new_task() after we do
the load-balance pass in sched_fork(). This yields complicated semantics
in that we have to deal with tasks on different RQs and the
set_task_cpu() calls in copy_process() and sched_fork()

Rename ->task_new() to ->task_fork() and call it from sched_fork()
before the balancing, this gives the policy a clear point to place the
task.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:05 +01:00
Peter Zijlstra
ab19cb2331 sched: Clean up ttwu() rq locking
Since set_task_clock() doesn't rely on rq->clock anymore we can simplyfy
the mess in ttwu().

Optimize things a bit by not fiddling with the IRQ state there.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:04 +01:00
Peter Zijlstra
5afcdab706 sched: Remove rq->clock coupling from set_task_cpu()
set_task_cpu() should be rq invariant and only touch task state, it
currently fails to do so, which opens up a few races, since not all
callers hold both rq->locks.

Remove the relyance on rq->clock, as any site calling set_task_cpu()
should also do a remote clock update, which should ensure the observed
time between these two cpus is monotonic, as per
kernel/sched_clock.c:sched_clock_remote().

Therefore we can simply remove the clock_offset bits and be happy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:03 +01:00
Peter Zijlstra
970b13bacb sched: Consolidate select_task_rq() callers
Small cleanup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
[ v2: build fix ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:02 +01:00
Peter Zijlstra
6b314d0e11 sched: Remove sysctl.sched_features
Since we've had a much saner debugfs interface to this, remove the
sysctl one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
[ v2: build fix ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:03:01 +01:00
Thomas Gleixner
dba091b9e3 sched: Protect sched_rr_get_param() access to task->sched_class
sched_rr_get_param calls
task->sched_class->get_rr_interval(task) without protection
against a concurrent sched_setscheduler() call which modifies
task->sched_class.

Serialize the access with task_rq_lock(task) and hand the rq
pointer into get_rr_interval() as it's needed at least in the
sched_fair implementation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:01:07 +01:00
Thomas Gleixner
3160568371 sched: Protect task->cpus_allowed access in sched_getaffinity()
sched_getaffinity() is not protected against a concurrent
modification of the tasks affinity.

Serialize the access with task_rq_lock(task).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091208202026.769251187@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 10:01:06 +01:00
Xiao Guangrong
ec89a06fd4 perf_event: Cleanup for cpu_clock_perf_event_update()
Using atomic64_xchg() instead of atomic64_read() and
atomic64_set().

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B1F19DC.90204@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 09:56:27 +01:00
Xiao Guangrong
b93f7978ad perf_event: Allocate children's perf_event_ctxp at the right time
In current code, children task will allocate memory for
'child->perf_event_ctxp' if the parent is counted, we can
do it only if the parent allowed children inherit it.

It can save memory and reduce overhead.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B1F19A8.5040805@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 09:56:27 +01:00
Xiao Guangrong
aa5452d70c perf_event: Clean up __perf_event_init_context()
Clean up the code a bit:

 - define 'perf_cpu_context' variable with 'static'
 - use kzalloc() instead of kmalloc() and memset()

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <4B1F194D.7080306@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 09:56:27 +01:00
Frederic Weisbecker
44234adcdc hw-breakpoints: Modify breakpoints without unregistering them
Currently, when ptrace needs to modify a breakpoint, like disabling
it, changing its address, type or len, it calls
modify_user_hw_breakpoint(). This latter will perform the heavy and
racy task of unregistering the old breakpoint and registering a new
one.

This is racy as someone else might steal the reserved breakpoint
slot under us, which is undesired as the breakpoint is only
supposed to be modified, sometimes in the middle of a debugging
workflow. We don't want our slot to be stolen in the middle.

So instead of unregistering/registering the breakpoint, just
disable it while we modify its breakpoint fields and re-enable it
after if necessary.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
LKML-Reference: <1260347148-5519-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-09 09:48:20 +01:00
Masami Hiramatsu
a7c312bed7 trace-kprobe: Support delete probe syntax
Support delete probe syntax. The syntax is "-:[group/]event".

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
LKML-Reference: <20091208220316.10142.39192.stgit@dhcp-100-2-132.bos.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
2009-12-09 07:26:53 +01:00
Linus Torvalds
2b876f95d0 Merge branches 'timers-for-linus-ntp' and 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus-ntp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ntp: Provide compability defines (You say MOD_NANO, I say ADJ_NANO)

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: do not execute DEBUG_SHIRQ when irq setup failed
2009-12-08 19:30:19 -08:00
Linus Torvalds
fbf07eac7b Merge branch 'timers-for-linus-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimer: Fix /proc/timer_list regression
  itimers: Fix racy writes to cpu_itimer fields
  timekeeping: Fix clock_gettime vsyscall time warp
2009-12-08 19:28:09 -08:00
Linus Torvalds
60d8ce2cd6 Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timers, init: Limit the number of per cpu calibration bootup messages
  posix-cpu-timers: optimize and document timer_create callback
  clockevents: Add missing include to pacify sparse
  x86: vmiclock: Fix printk format
  x86: Fix printk format due to variable type change
  sparc: fix printk for change of variable type
  clocksource/events: Fix fallout of generic code changes
  nohz: Allow 32-bit machines to sleep for more than 2.15 seconds
  nohz: Track last do_timer() cpu
  nohz: Prevent clocksource wrapping during idle
  nohz: Type cast printk argument
  mips: Use generic mult/shift factor calculation for clocks
  clocksource: Provide a generic mult/shift factor calculation
  clockevents: Use u32 for mult and shift factors
  nohz: Introduce arch_needs_cpu
  nohz: Reuse ktime in sub-functions of tick_check_idle.
  time: Remove xtime_cache
  time: Implement logarithmic time accumulation
2009-12-08 19:27:08 -08:00
Linus Torvalds
4d2a914239 Merge branch 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  core: Clean up user return notifers use of per_cpu
2009-12-08 13:24:20 -08:00
Linus Torvalds
6035ccd8e9 Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
  cfq-iosched: Do not access cfqq after freeing it
  block: include linux/err.h to use ERR_PTR
  cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
  blkio: Allow CFQ group IO scheduling even when CFQ is a module
  blkio: Implement dynamic io controlling policy registration
  blkio: Export some symbols from blkio as its user CFQ can be a module
  block: Fix io_context leak after failure of clone with CLONE_IO
  block: Fix io_context leak after clone with CLONE_IO
  cfq-iosched: make nonrot check logic consistent
  io controller: quick fix for blk-cgroup and modular CFQ
  cfq-iosched: move IO controller declerations to a header file
  cfq-iosched: fix compile problem with !CONFIG_CGROUP
  blkio: Documentation
  blkio: Wait on sync-noidle queue even if rq_noidle = 1
  blkio: Implement group_isolation tunable
  blkio: Determine async workload length based on total number of queues
  blkio: Wait for cfq queue to get backlogged if group is empty
  blkio: Propagate cgroup weight updation to cfq groups
  blkio: Drop the reference to queue once the task changes cgroup
  blkio: Provide some isolation between groups
  ...
2009-12-08 08:19:16 -08:00
Linus Torvalds
dad3de7d00 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: Add flag for devices capable of generating run-time wake-up events
  PM / Runtime: Remove unnecessary braces in __pm_runtime_set_status()
  PM / Runtime: Make documentation of runtime_idle() agree with the code
  PM / Runtime: Ensure timer_expires is nonzero in pm_schedule_suspend()
  PM / Runtime: Use deferred_resume flag in pm_request_resume
  PM / Runtime: Export the PM runtime workqueue
  PM / Runtime: Fix lockdep warning in __pm_runtime_set_status()
  PM / Hibernate: Swap, use KERN_CONT
  PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c
  PM / Hibernate: Move swap functions to kernel/power/swap.c.
  PM / freezer: Don't get over-anxious while waiting
2009-12-08 08:07:16 -08:00
Linus Torvalds
ed9216c171 Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits)
  KVM: VMX: Fix comparison of guest efer with stale host value
  KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c
  KVM: Drop user return notifier when disabling virtualization on a cpu
  KVM: VMX: Disable unrestricted guest when EPT disabled
  KVM: x86 emulator: limit instructions to 15 bytes
  KVM: s390: Make psw available on all exits, not just a subset
  KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
  KVM: VMX: Report unexpected simultaneous exceptions as internal errors
  KVM: Allow internal errors reported to userspace to carry extra data
  KVM: Reorder IOCTLs in main kvm.h
  KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG
  KVM: only clear irq_source_id if irqchip is present
  KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic
  KVM: x86: disallow multiple KVM_CREATE_IRQCHIP
  KVM: VMX: Remove vmx->msr_offset_efer
  KVM: MMU: update invlpg handler comment
  KVM: VMX: move CR3/PDPTR update to vmx_set_cr3
  KVM: remove duplicated task_switch check
  KVM: powerpc: Fix BUILD_BUG_ON condition
  KVM: VMX: Use shared msr infrastructure
  ...

Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile
2009-12-08 08:02:38 -08:00
Linus Torvalds
d7fc02c7ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
  mac80211: fix reorder buffer release
  iwmc3200wifi: Enable wimax core through module parameter
  iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
  iwmc3200wifi: Coex table command does not expect a response
  iwmc3200wifi: Update wiwi priority table
  iwlwifi: driver version track kernel version
  iwlwifi: indicate uCode type when fail dump error/event log
  iwl3945: remove duplicated event logging code
  b43: fix two warnings
  ipw2100: fix rebooting hang with driver loaded
  cfg80211: indent regulatory messages with spaces
  iwmc3200wifi: fix NULL pointer dereference in pmkid update
  mac80211: Fix TX status reporting for injected data frames
  ath9k: enable 2GHz band only if the device supports it
  airo: Fix integer overflow warning
  rt2x00: Fix padding bug on L2PAD devices.
  WE: Fix set events not propagated
  b43legacy: avoid PPC fault during resume
  b43: avoid PPC fault during resume
  tcp: fix a timewait refcnt race
  ...

Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
	kernel/sysctl_check.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv6/addrconf.c
	net/sctp/sysctl.c
2009-12-08 07:55:01 -08:00
Linus Torvalds
1557d33007 Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
  security/tomoyo: Remove now unnecessary handling of security_sysctl.
  security/tomoyo: Add a special case to handle accesses through the internal proc mount.
  sysctl: Drop & in front of every proc_handler.
  sysctl: Remove CTL_NONE and CTL_UNNUMBERED
  sysctl: kill dead ctl_handler definitions.
  sysctl: Remove the last of the generic binary sysctl support
  sysctl net: Remove unused binary sysctl code
  sysctl security/tomoyo: Don't look at ctl_name
  sysctl arm: Remove binary sysctl support
  sysctl x86: Remove dead binary sysctl support
  sysctl sh: Remove dead binary sysctl support
  sysctl powerpc: Remove dead binary sysctl support
  sysctl ia64: Remove dead binary sysctl support
  sysctl s390: Remove dead sysctl binary support
  sysctl frv: Remove dead binary sysctl support
  sysctl mips/lasat: Remove dead binary sysctl support
  sysctl drivers: Remove dead binary sysctl support
  sysctl crypto: Remove dead binary sysctl support
  sysctl security/keys: Remove dead binary sysctl support
  sysctl kernel: Remove binary sysctl logic
  ...
2009-12-08 07:38:50 -08:00
Andi Kleen
722d017237 futex: Take mmap_sem for get_user_pages in fault_in_user_writeable
get_user_pages() must be called with mmap_sem held.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linuxfoundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091208121942.GA21298@basil.fritz.box>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-12-08 14:59:36 +01:00
Stephen Rothwell
6ab8886326 perf: hw_breakpoints: Fix percpu namespace clash
Today's linux-next build failed with:

  kernel/hw_breakpoint.c:86: error: 'task_bp_pinned' redeclared as different kind of symbol
  ...

Caused by commit dd17c8f729 ("percpu:
remove per_cpu__ prefix") from the percpu tree interacting with
commit 56053170ea ("hw-breakpoints:
Fix task-bound breakpoint slot allocation") from the tip tree.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20091208182515.bb6dda4a.sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-08 09:34:43 +01:00
Tejun Heo
50de1a8ef1 Merge branch 'for-linus' into for-next
Conflicts:
	mm/percpu.c
2009-12-08 10:02:12 +09:00
Jiri Kosina
d014d04386 Merge branch 'for-next' into for-linus
Conflicts:

	kernel/irq/chip.c
2009-12-07 18:36:35 +01:00
Steven Rostedt
c521efd170 tracing: Add pipe_close interface
An ftrace plugin can add a pipe_open interface when the user opens
trace_pipe. But if the plugin allocates something within the pipe_open
it can not free it because there exists no pipe_close. The hook to
the trace file open has a corresponding close. The closing of the
trace_pipe file should also have a corresponding close.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-12-07 12:01:35 -05:00
Frederic Weisbecker
56053170ea hw-breakpoints: Fix task-bound breakpoint slot allocation
Whatever the context nature of a breakpoint, we always perform the
following constraint checks before allocating it a slot:

- Check the number of pinned breakpoint bound the concerned cpus
- Check the max number of task-bound breakpoints that are belonging
  to a task.
- Add both and see if we have a reamining slot for the new breakpoint

This is the right thing to do when we are about to register a cpu-only
bound breakpoint. But not if we are dealing with a task bound
breakpoint. What we want in this case is:

- Check the number of pinned breakpoint bound the concerned cpus
- Check the number of breakpoints that already belong to the task
  in which the breakpoint to register is bound to.
- Add both

This fixes a regression that makes the "firefox -g" command fail to
register breakpoints once we deal with a secondary thread.

Reported-by: Walt <w41ter@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
2009-12-07 07:05:28 +01:00
Peter Zijlstra
6ad4c18884 sched: Fix balance vs hotplug race
Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
sched domain managment) we have cpu_active_mask which is suppose to rule
scheduler migration and load-balancing, except it never (fully) did.

The particular problem being solved here is a crash in try_to_wake_up()
where select_task_rq() ends up selecting an offline cpu because
select_task_rq_fair() trusts the sched_domain tree to reflect the
current state of affairs, similarly select_task_rq_rt() trusts the
root_domain.

However, the sched_domains are updated from CPU_DEAD, which is after the
cpu is taken offline and after stop_machine is done. Therefore it can
race perfectly well with code assuming the domains are right.

Cure this by building the domains from cpu_active_mask on
CPU_DOWN_PREPARE.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-06 21:10:56 +01:00
Geert Uytterhoeven
e1b8090bdf cpumask: Fix generate_sched_domains() for UP
Commit acc3f5d7ca ("cpumask:
Partition_sched_domains takes array of cpumask_var_t") changed
the function signature of generate_sched_domains() for the
CONFIG_SMP=y case, but forgot to update the corresponding
function for the CONFIG_SMP=n case, causing:

  kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <alpine.DEB.2.00.0912062038070.5693@ayla.of.borg>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-06 21:08:41 +01:00
Alan Stern
7b199ca202 PM / Runtime: Export the PM runtime workqueue
This patch (as1306) exports the PM runtime workqueue for use by
loadable modules.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-12-06 16:17:56 +01:00
Jiri Slaby
66d0ae4d6f PM / Hibernate: Swap, use KERN_CONT
Use KERN_CONT in save_image() for printks, so that anybody won't
try to add a loglevel.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-12-06 16:16:24 +01:00
Nigel Cunningham
8e60c6a134 PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c
Shift the remaining declaration of the variable in_suspend and the
function swsusp_show_speed from swsusp.c to hibernate.c, and delete
swsusp.c.

Signed-off-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-12-06 16:16:07 +01:00
Nigel Cunningham
0414f2ec03 PM / Hibernate: Move swap functions to kernel/power/swap.c.
Move hibernation code's functions for allocating and freeing swap
from swsusp.c to swap.c, which is where you'd expect to find them.

Signed-off-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-12-06 16:15:53 +01:00
Rafael J. Wysocki
64357ed468 Merge branch 'master' into for-linus 2009-12-06 16:06:11 +01:00
Frank Rowand
109d71c6dd lockstat: Fix min, max times in /proc/lock_stats
Fix min, max times in /proc/lock_stats

(1) When collecting lock hold and wait times, if the current minimum
    time is zero, it will be replaced by the next time.

(2) When aggregating minimum and maximum lock hold and wait times
    accross cpus, the values are added, instead of selecting the
    minimum and maximum.

Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4B05BBAE.2050005@am.sony.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-06 13:20:00 +01:00
Frederic Weisbecker
b326e9560a hw-breakpoints: Use overflow handler instead of the event callback
struct perf_event::event callback was called when a breakpoint
triggers. But this is a rather opaque callback, pretty
tied-only to the breakpoint API and not really integrated into perf
as it triggers even when we don't overflow.

We prefer to use overflow_handler() as it fits into the perf events
rules, being called only when we overflow.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
2009-12-06 08:27:18 +01:00
Frederic Weisbecker
2f0993e0fb hw-breakpoints: Drop callback and task parameters from modify helper
Drop the callback and task parameters from modify_user_hw_breakpoint().
For now we have no user that need to modify a breakpoint to the point
of changing its handler or its task context.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
2009-12-06 08:27:17 +01:00
Linus Torvalds
897e81bea1 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
  sched, cputime: Introduce thread_group_times()
  sched, cputime: Cleanups related to task_times()
  Revert "sched, x86: Optimize branch hint in __switch_to()"
  sched: Fix isolcpus boot option
  sched: Revert 498657a478
  sched, time: Define nsecs_to_jiffies()
  sched: Remove task_{u,s,g}time()
  sched: Introduce task_times() to replace task_{u,s}time() pair
  sched: Limit the number of scheduler debug messages
  sched.c: Call debug_show_all_locks() when dumping all tasks
  sched, x86: Optimize branch hint in __switch_to()
  sched: Optimize branch hint in context_switch()
  sched: Optimize branch hint in pick_next_task_fair()
  sched_feat_write(): Update ppos instead of file->f_pos
  sched: Sched_rt_periodic_timer vs cpu hotplug
  sched, kvm: Fix race condition involving sched_in_preempt_notifers
  sched: More generic WAKE_AFFINE vs select_idle_sibling()
  sched: Cleanup select_task_rq_fair()
  sched: Fix granularity of task_u/stime()
  sched: Fix/add missing update_rq_clock() calls
  ...
2009-12-05 15:30:49 -08:00
Linus Torvalds
c3fa27d136 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (470 commits)
  x86: Fix comments of register/stack access functions
  perf tools: Replace %m with %a in sscanf
  hw-breakpoints: Keep track of user disabled breakpoints
  tracing/syscalls: Make syscall events print callbacks static
  tracing: Add DEFINE_EVENT(), DEFINE_SINGLE_EVENT() support to docbook
  perf: Don't free perf_mmap_data until work has been done
  perf_event: Fix compile error
  perf tools: Fix _GNU_SOURCE macro related strndup() build error
  trace_syscalls: Remove unused syscall_name_to_nr()
  trace_syscalls: Simplify syscall profile
  trace_syscalls: Remove duplicate init_enter_##sname()
  trace_syscalls: Add syscall_nr field to struct syscall_metadata
  trace_syscalls: Remove enter_id exit_id
  trace_syscalls: Set event_enter_##sname->data to its metadata
  trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit
  perf_event: Initialize data.period in perf_swevent_hrtimer()
  perf probe: Simplify event naming
  perf probe: Add --list option for listing current probe events
  perf probe: Add argv_split() from lib/argv_split.c
  perf probe: Move probe event utility functions to probe-event.c
  ...
2009-12-05 15:30:21 -08:00
David S. Miller
28b4d5cc17 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/net/pcmcia/fmvj18x_cs.c
	drivers/net/pcmcia/nmclan_cs.c
	drivers/net/pcmcia/xirc2ps_cs.c
	drivers/net/wireless/ray_cs.c
2009-12-05 15:22:26 -08:00
Linus Torvalds
96fa2b508d Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (40 commits)
  tracing: Separate raw syscall from syscall tracer
  ring-buffer-benchmark: Add parameters to set produce/consumer priorities
  tracing, function tracer: Clean up strstrip() usage
  ring-buffer benchmark: Run producer/consumer threads at nice +19
  tracing: Remove the stale include/trace/power.h
  tracing: Only print objcopy version warning once from recordmcount
  tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used
  ring-buffer: Move access to commit_page up into function used
  tracing: do not disable interrupts for trace_clock_local
  ring-buffer: Add multiple iterations between benchmark timestamps
  kprobes: Sanitize struct kretprobe_instance allocations
  tracing: Fix to use __always_unused attribute
  compiler: Introduce __always_unused
  tracing: Exit with error if a weak function is used in recordmcount.pl
  tracing: Move conditional into update_funcs() in recordmcount.pl
  tracing: Add regex for weak functions in recordmcount.pl
  tracing: Move mcount section search to front of loop in recordmcount.pl
  tracing: Fix objcopy revision check in recordmcount.pl
  tracing: Check absolute path of input file in recordmcount.pl
  tracing: Correct the check for number of arguments in recordmcount.pl
  ...
2009-12-05 09:53:36 -08:00
Linus Torvalds
7a797cdcca Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: Fix trace_marker output
  tracing: Fix event format export
  tracing: Fix return value of tracing_stats_read()
2009-12-05 09:53:21 -08:00
Linus Torvalds
bb2166c898 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Fix spurious irq seqfile conversion
  genirq: switch /proc/irq/*/spurious to seq_file
  irq: Do not attempt to create subdirectories if /proc/irq/<irq> failed
  irq: Remove unused debug_poll_all_shared_irqs()
  irq: Fix docbook comments
  irq: trivial: Fix typo in comment for #endif
2009-12-05 09:53:08 -08:00
Linus Torvalds
0bf7969fea Merge branch 'core-softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softlockup: Fix hung_task_check_count sysctl
2009-12-05 09:52:46 -08:00
Linus Torvalds
69f061e0c2 Merge branch 'core-signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  signal: Print warning message when dropping signals
  signal: Fix alternate signal stack check
2009-12-05 09:52:33 -08:00
Linus Torvalds
607781762e Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
  rcu: Make RCU's CPU-stall detector be default
  rcu: Add expedited grace-period support for preemptible RCU
  rcu: Enable fourth level of TREE_RCU hierarchy
  rcu: Rename "quiet" functions
  rcu: Re-arrange code to reduce #ifdef pain
  rcu: Eliminate unneeded function wrapping
  rcu: Fix grace-period-stall bug on large systems with CPU hotplug
  rcu: Eliminate __rcu_pending() false positives
  rcu: Further cleanups of use of lastcomp
  rcu: Simplify association of forced quiescent states with grace periods
  rcu: Accelerate callback processing on CPUs not detecting GP end
  rcu: Mark init-time-only rcu_bootup_announce() as __init
  rcu: Simplify association of quiescent states with grace periods
  rcu: Rename dynticks_completed to completed_fqs
  rcu: Enable synchronize_sched_expedited() fastpath
  rcu: Remove inline from forward-referenced functions
  rcu: Fix note_new_gpnum() uses of ->gpnum
  rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter
  rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter
  rcu: Cleanup: balance rcu_irq_enter()/rcu_irq_exit() calls
  ...
2009-12-05 09:52:14 -08:00
Linus Torvalds
d0b093a8b5 Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ratelimit: Make suppressed output messages more useful
  printk: Remove ratelimit.h from kernel.h
  ratelimit: Fix/allow use in atomic contexts
  ratelimit: Use per ratelimit context locking
2009-12-05 09:50:22 -08:00
Linus Torvalds
3e72b810e3 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  mutex: Fix missing conditions to build mutex_spin_on_owner()
  mutex: Better control mutex adaptive spinning config
  locking, task_struct: Reduce size on TRACE_IRQFLAGS and 64bit
  locking: Use __[SPIN|RW]_LOCK_UNLOCKED in [spin|rw]_lock_init()
  locking: Remove unused prototype
  locking: Reduce ifdefs in kernel/spinlock.c
  locking: Make inlining decision Kconfig based
2009-12-05 09:49:59 -08:00
Linus Torvalds
9b269d4034 Merge branch 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  generic-ipi: Add smp_call_function_any()
  generic-ipi: Fix misleading smp_call_function*() description
2009-12-05 09:49:46 -08:00
Louis Rilling
b69f229206 block: Fix io_context leak after failure of clone with CLONE_IO
With CLONE_IO, parent's io_context->nr_tasks is incremented, but never
decremented whenever copy_process() fails afterwards, which prevents
exit_io_context() from calling IO schedulers exit functions.

Give a task_struct to exit_io_context(), and call exit_io_context() instead of
put_io_context() in copy_process() cleanup path.

Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-12-04 16:36:18 +01:00
André Goddard Rosa
af901ca181 tree-wide: fix assorted typos all over the place
That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-04 15:39:55 +01:00
Uwe Kleine-König
fbfecd3712 tree-wide: fix typos "couter" -> "counter"
This patch was generated by

	git grep -E -i -l 'couter' | xargs -r perl -p -i -e 's/couter/counter/'

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-04 15:39:51 +01:00
Liuweni
fb3d38b990 fix kerneldoc for set_irq_msi()
Signed-off-by: Liuweni <qingshenlwy@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-04 15:39:50 +01:00
Patrick McHardy
8153a10c08 ipv4 05/05: add sysctl to accept packets with local source addresses
commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8
Author: Patrick McHardy <kaber@trash.net>
Date:   Thu Dec 3 12:16:35 2009 +0100

    ipv4: add sysctl to accept packets with local source addresses

    Change fib_validate_source() to accept packets with a local source address when
    the "accept_local" sysctl is set for the incoming inet device. Combined with the
    previous patches, this allows to communicate between multiple local interfaces
    over the wire.

    Signed-off-by: Patrick McHardy <kaber@trash.net>

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-03 12:14:38 -08:00
Frederic Weisbecker
c08f782985 mutex: Fix missing conditions to build mutex_spin_on_owner()
We don't need to build mutex_spin_on_owner() if we have
CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as
it won't be used under such configs.

Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary
checks before building it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2009-12-03 11:50:11 +01:00
Frederic Weisbecker
c02260277e mutex: Better control mutex adaptive spinning config
Introduce CONFIG_MUTEX_SPIN_ON_OWNER so that we can centralize
in a single place the conditions that determine its definition
and use.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259783357-8542-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2009-12-03 11:50:11 +01:00
Paul E. McKenney
d9a3da0699 rcu: Add expedited grace-period support for preemptible RCU
Implement an synchronize_rcu_expedited() for preemptible RCU
that actually is expedited.  This uses
synchronize_sched_expedited() to force all threads currently
running in a preemptible-RCU read-side critical section onto the
appropriate ->blocked_tasks[] list, then takes a snapshot of all
of these lists and waits for them to drain.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1259784616158-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-03 11:35:25 +01:00
Paul E. McKenney
cf244dc01b rcu: Enable fourth level of TREE_RCU hierarchy
Enable a fourth level of rcu_node hierarchy for TREE_RCU and
TREE_PREEMPT_RCU.  This is for stress-testing and experiemental
purposes only, although in theory this would enable 16,777,216
CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit
systems. Normal experimental use of this fourth level will
normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system,
though the more adventurous (and more fortunate) experimenters
may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even
CONFIG_RCU_FANOUT=4 for 256-CPU systems.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846161257-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-03 11:34:53 +01:00
Paul E. McKenney
d3f6bad391 rcu: Rename "quiet" functions
The number of "quiet" functions has grown recently, and the
names are no longer very descriptive.  The point of all of these
functions is to do some portion of the task of reporting a
quiescent state, so rename them accordingly:

o	cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
	quiescent state to the per-CPU rcu_data structure.  If this
	turns out to be a new quiescent state for this grace period,
	then rcu_report_qs_rnp() will be invoked to propagate the
	quiescent state up the rcu_node hierarchy.

o	cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
	a quiescent state for a given CPU (or possibly a set of CPUs)
	up the rcu_node hierarchy.

o	cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
	reports a full set of quiescent states to the global rcu_state
	structure.

o	task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
	a quiescent state due to a task exiting an RCU read-side critical
	section that had previously blocked in that same critical section.
	As indicated by the new name, this type of quiescent state is
	reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
	to do so).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846163698-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-03 11:34:26 +01:00
Avi Kivity
58988b07cf Merge remote branch 'tip/x86/entry' into kvm-updates/2.6.33
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:30:06 +02:00
James Morris
c84d6efd36 Merge branch 'master' into next 2009-12-03 12:03:40 +05:30
Helge Deller
35dead4235 modules: don't export section names of empty sections via sysfs
On the parisc architecture we face for each and every loaded kernel module
this kernel "badness warning":
  sysfs: cannot create duplicate filename '/module/ac97_bus/sections/.text'
  Badness at fs/sysfs/dir.c:487

Reason for that is, that on parisc all kernel modules do have multiple
.text sections due to the usage of the -ffunction-sections compiler flag
which is needed to reach all jump targets on this platform.

An objdump on such a kernel module gives:
Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .note.gnu.build-id 00000024  00000000  00000000  00000034  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .text         00000000  00000000  00000000  00000058  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  2 .text.ac97_bus_match 0000001c  00000000  00000000  00000058  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  3 .text         00000000  00000000  00000000  000000d4  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
...
Since the .text sections are empty (size of 0 bytes) and won't be
loaded by the kernel module loader anyway, I don't see a reason
why such sections need to be listed under
/sys/module/<module_name>/sections/<section_name> either.

The attached patch does solve this issue by not exporting section
names which are empty.

This fixes bugzilla http://bugzilla.kernel.org/show_bug.cgi?id=14703

Signed-off-by: Helge Deller <deller@gmx.de>
CC: rusty@rustcorp.com.au
CC: akpm@linux-foundation.org
CC: James.Bottomley@HansenPartnership.com
CC: roland@redhat.com
CC: dave@hiauly1.hia.nrc.ca
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-02 15:38:25 -08:00
Hidetoshi Seto
0cf55e1ec0 sched, cputime: Introduce thread_group_times()
This is a real fix for problem of utime/stime values decreasing
described in the thread:

   http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

 - {u,s}time in task_struct are increased every time when the thread
   is interrupted by a tick (timer interrupt).

 - When a thread exits, its {u,s}time are added to signal->{u,s}time,
   after adjusted by task_times().

 - When all threads in a thread_group exits, accumulated {u,s}time
   (and also c{u,s}time) in signal struct are added to c{u,s}time
   in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

 - task_times(), to get cputime of a thread:
   This function returns adjusted values that originates from raw
   {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

 - thread_group_cputime(), to get cputime of a thread group:
   This function returns sum of all {u,s}time of living threads in
   the group, plus {u,s}time in the signal struct that is sum of
   adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

  group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

  group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

 - Modify thread's exit (= __exit_signal()) not to use task_times().
   It means {u,s}time in signal struct accumulates raw values instead
   of adjusted values.  As the result it makes thread_group_cputime()
   to return pure sum of "raw" values.

 - Introduce a new function thread_group_times(*task, *utime, *stime)
   that converts "raw" values of thread_group_cputime() to "adjusted"
   values, in same calculation procedure as task_times().

 - Modify group's exit (= wait_task_zombie()) to use this introduced
   thread_group_times().  It make c{u,s}time in signal struct to
   have adjusted values like before this patch.

 - Replace some thread_group_cputime() by thread_group_times().
   This replacements are only applied where conveys the "adjusted"
   cputime to users, and where already uses task_times() near by it.
   (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)

This patch have a positive side effect:

 - Before this patch, if a group contains many short-life threads
   (e.g. runs 0.9ms and not interrupted by ticks), the group's
   cputime could be invisible since thread's cputime was accumulated
   after adjusted: imagine adjustment function as adj(ticks, runtime),
     {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
   After this patch it will not happen because the adjustment is
   applied after accumulated.

v2:
 - remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 17:32:40 +01:00
Hidetoshi Seto
d99ca3b977 sched, cputime: Cleanups related to task_times()
- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
  only when !VIRT_CPU_ACCOUNTING.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B1624C7.7040302@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 17:32:39 +01:00
Rusty Russell
bdddd2963c sched: Fix isolcpus boot option
Anton Blanchard wrote:

> We allocate and zero cpu_isolated_map after the isolcpus
> __setup option has run. This means cpu_isolated_map always
> ends up empty and if CPUMASK_OFFSTACK is enabled we write to a
> cpumask that hasn't been allocated.

I introduced this regression in 49557e6203 (sched: Fix
boot crash by zalloc()ing most of the cpu masks).

Use the bootmem allocator if they set isolcpus=, otherwise
allocate and zero like normal.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: peterz@infradead.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <200912021409.17013.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Anton Blanchard <anton@samba.org>
2009-12-02 10:27:16 +01:00
Avi Kivity
1786bf009f core: Clean up user return notifers use of per_cpu
Instead of using per_cpu(..., raw_smp_processor_id()), use
__get_cpu_var(...).

Signed-off-by: Avi Kivity <avi@redhat.com>
LKML-Reference: <1259578491-4589-1-git-send-email-avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 10:22:59 +01:00
Tejun Heo
8592e6486a sched: Revert 498657a478
498657a478 incorrectly assumed
that preempt wasn't disabled around context_switch() and thus
was fixing imaginary problem.  It also broke KVM because it
depended on ->sched_in() to be called with irq enabled so that
it can do smp calls from there.

Revert the incorrect commit and add comment describing different
contexts under with the two callbacks are invoked.

Avi: spotted transposed in/out in the added comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: peterz@infradead.org
Cc: efault@gmx.de
Cc: rusty@rustcorp.com.au
LKML-Reference: <1259726212-30259-2-git-send-email-tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 09:55:33 +01:00
Randy Dunlap
595dd3d8bf kmsg_dump: fix build for CONFIG_PRINTK=n
kmsg_dump() fails to build when CONFIG_PRINTK=n; provide stubs
for the kmsg_dump*() functions when CONFIG_PRINTK=n.

kernel/printk.c: In function 'kmsg_dump':
kernel/printk.c:1501: error: 'log_buf_len' undeclared (first use in this function)
kernel/printk.c:1502: error: 'logged_chars' undeclared (first use in this function)
kernel/printk.c:1506: error: 'log_buf' undeclared (first use in this function)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-12-02 08:44:33 +00:00
Kristian Høgsberg
ec70ccd806 perf: Don't free perf_mmap_data until work has been done
In the CONFIG_PERF_USE_VMALLOC case, perf_mmap_data_free() only
schedules the cleanup of the perf_mmap_data struct.  In that
case we have to wait until the work has been done before we free
data.

Signed-off-by: Kristian Høgsberg <krh@bitplanet.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <1259697901-1747-1-git-send-email-krh@bitplanet.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-02 09:30:18 +01:00
David S. Miller
ff9c38bba3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/mac80211/ht.c
2009-12-01 22:13:38 -08:00
Lai Jiangshan
7be077f563 trace_syscalls: Remove unused syscall_name_to_nr()
After duplications are removed, syscall_name_to_nr() is unused.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D2A6.6060803@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 17:33:31 +01:00
Lai Jiangshan
3bbe84e9d3 trace_syscalls: Simplify syscall profile
use only one prof_sysenter_enable() instead of
prof_sysenter_enable_##sname()

use only one prof_sysenter_disable() instead of
prof_sysenter_disable_##sname()

use only one prof_sysexit_enable() instead of
prof_sysexit_enable_##sname()

use only one prof_sysexit_disable() instead of
prof_sysexit_disable_##sname()

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D2A1.8060304@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 17:33:30 +01:00
Lai Jiangshan
a1301da099 trace_syscalls: Remove duplicate init_enter_##sname()
use only one init_syscall_trace instead of
many init_enter_##sname()/init_exit_##sname()

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D29B.6090708@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 17:33:30 +01:00
Lai Jiangshan
c252f65793 trace_syscalls: Add syscall_nr field to struct syscall_metadata
Add syscall_nr field to struct syscall_metadata,
it helps us to get syscall number easier.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D293.6090800@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 17:33:29 +01:00
Lai Jiangshan
fcc19438dd trace_syscalls: Remove enter_id exit_id
use ->enter_event->id instead of ->enter_id
use ->exit_event->id instead of ->exit_id

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D288.7030001@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 17:33:29 +01:00
Lai Jiangshan
31c16b1334 trace_syscalls: Set event_enter_##sname->data to its metadata
Set event_enter_##sname->data to its metadata,
it makes codes simpler.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D282.7050709@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 17:33:28 +01:00
Lai Jiangshan
bf56a4ea9f trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit
fix event_enter_##sname->event
fix event_exit_##sname->event

remove unused event_syscall_enter and event_syscall_exit

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B14D278.4090209@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 17:33:28 +01:00
David Howells
f13a48bd79 SLOW_WORK: Move slow_work's proc file to debugfs
Move slow_work's debugging proc file to debugfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Requested-and-acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-01 08:20:31 -08:00
David Howells
fa1dae4906 SLOW_WORK: Fix the CONFIG_MODULES=n case
Commits 3d7a641 ("SLOW_WORK: Wait for outstanding work items belonging to a
module to clear") introduced some code to make sure that all of a module's
slow-work items were complete before that module was removed, and commit
3bde31a ("SLOW_WORK: Allow a requeueable work item to sleep till the thread is
needed") further extended that, breaking it in the process if CONFIG_MODULES=n:

    CC      kernel/slow-work.o
  kernel/slow-work.c: In function 'slow_work_execute':
  kernel/slow-work.c:313: error: 'slow_work_thread_processing' undeclared (first use in this function)
  kernel/slow-work.c:313: error: (Each undeclared identifier is reported only once
  kernel/slow-work.c:313: error: for each function it appears in.)
  kernel/slow-work.c: In function 'slow_work_wait_for_items':
  kernel/slow-work.c:950: error: 'slow_work_unreg_sync_lock' undeclared (first use in this function)
  kernel/slow-work.c:951: error: 'slow_work_unreg_wq' undeclared (first use in this function)
  kernel/slow-work.c:961: error: 'slow_work_unreg_work_item' undeclared (first use in this function)
  kernel/slow-work.c:974: error: 'slow_work_unreg_module' undeclared (first use in this function)
  kernel/slow-work.c:977: error: 'slow_work_thread_processing' undeclared (first use in this function)
  make[1]: *** [kernel/slow-work.o] Error 1

Fix this by:

 (1) Extracting the bits of slow_work_execute() that are contingent on
     CONFIG_MODULES, and the bits that should be, into inline functions and
     placing them into the #ifdef'd section that defines the relevant variables
     and adding stubs for moduleless kernels.  This allows the removal of some
     #ifdefs.

 (2) #ifdef'ing out the contents of slow_work_wait_for_items() in moduleless
     kernels.

The four functions related to handling module unloading synchronisation (and
their associated variables) could be offloaded into a separate .c file, but
each function is only used once and three of them are tiny, so doing so would
prevent them from being inlined.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-01 07:35:11 -08:00
Xiao Guangrong
59d069eb5a perf_event: Initialize data.period in perf_swevent_hrtimer()
In current code in perf_swevent_hrtimer(), data.period is not
initialized, The result is obvious wrong:

 # ./perf record -f -e cpu-clock make
 # ./perf report
 # Samples: 1740
 #
 # Overhead   Command                                   ......
 # ........  ........  ..........................................
 #
   1025422183050275328.00%        sh  libc-2.9.90.so ...
   1025422183050275328.00%      perl  libperl.so     ...
   1025422168240043264.00%      perl  [kernel]       ...
   1025422030011210752.00%      perl  [kernel]       ...

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <4B14E220.2050107@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 11:19:07 +01:00
Masami Hiramatsu
ba8665d7dd trace_kprobes: Fix a memory leak bug and check kstrdup() return value
Fix a memory leak case in create_trace_probe(). When an argument
is too long (> MAX_ARGSTR_LEN), it just jumps to error path. In
that case tp->args[i].name is not released.
This also fixes a bug to check kstrdup()'s return value.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: systemtap <systemtap@sources.redhat.com>
Cc: DLE <dle-develop@lists.sourceforge.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091201001919.10235.56455.stgit@harusame>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-01 08:19:59 +01:00
Simon Kagstrom
456b565cc5 core: Add kernel message dumper to call on oopses and panics
The core functionality is implemented as per Linus suggestion from

  http://lists.infradead.org/pipermail/linux-mtd/2009-October/027620.html

(with the kmsg_dump implementation by Linus). A struct kmsg_dumper has
been added which contains a callback to dump the kernel log buffers on
crashes. The kmsg_dump function gets called from oops_exit() and panic()
and invokes this callbacks with the crash reason.

[dwmw2: Fix log_end handling]
Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net>
Reviewed-by: Anders Grafstrom <anders.grafstrom@netinsight.net>
Reviewed-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-11-30 12:01:49 +00:00
Avi Kivity
8e7cac7980 core: Fix user return notifier on fork()
fork() clones all thread_info flags, including
TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu
which doesn't have user return notifiers set, this causes user
return notifiers to trigger without any way of clearing itself.

This is easy to trigger with a forky workload on the host in
parallel with kvm, resulting in a cpu in an endless loop on the
verge of returning to userspace.

Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork.

Signed-off-by: Avi Kivity <avi@redhat.com>
LKML-Reference: <1259505288-16559-1-git-send-email-avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-29 22:03:04 +01:00
Lai Jiangshan
52a11f3549 trace_kprobes: Don't output zero offset
"symbol_name+0" is not so friendly.
It makes the output longer.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B0CEBCB.7080309@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-27 06:43:05 +01:00
Lai Jiangshan
3d9b2e1ddf trace_kprobes: Always show group name
Sometimes the group name is not "kprobes",
It'll be better if we can read it from tracing/kprobe_events.

 # echo 'r:laijs/vfs_read vfs_read %ax' > kprobe_events
 # cat kprobe_events
 r:laijs/vfs_read vfs_read %ax=%ax

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B0CEBAF.6000104@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-27 06:43:04 +01:00
Lai Jiangshan
abab9d37d2 trace_kprobes: Fix memory leak
tp->nr_args is not set before we "goto error",
it causes memory leak for free_trace_probe() use tp->nr_args
to free memory of args.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4B0CEB95.2060107@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-27 06:43:04 +01:00
Lai Jiangshan
0f1ef51d24 trace_syscalls: Add syscall nr field
Field syscall number is missed in syscall_enter_define_fields()/
syscall_exit_define_fields().

Syscall number is also needed for event filter or other users.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <4B0E330D.1070206@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-27 06:24:19 +01:00