linux

Author	SHA1	Message	Date
Ingo Molnar	e02c4fd314	Merge branch 'tip/tracing/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent	2010-03-04 11:51:29 +01:00
Ingo Molnar	4f16d4e0c9	Merge branch 'perf/core' into perf/urgent Merge reason: Switch from pre-merge topical split to the post-merge urgent track Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 11:47:52 +01:00
Paul E. McKenney	db1466b3e1	rcu: Use wrapper function instead of exporting tasklist_lock Lockdep-RCU commit `d11c563d` exported tasklist_lock, which is not a good thing. This patch instead exports a function that uses lockdep to check whether tasklist_lock is held. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Christoph Hellwig <hch@lst.de> LKML-Reference: <1267631219-8713-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 11:46:14 +01:00
Ingo Molnar	0e064caf64	Merge branches 'core/futexes' and 'core/iommu' into core/urgent Merge reason: Switch from topical split to the stabilization track Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 11:45:31 +01:00
Steffen Klassert	7478138782	padata: Allocate the cpumask for the padata instance The cpumask of the padata instance was used without allocated. This caused boot crashes if CONFIG_CPUMASK_OFFSTACK is enabled. This patch fixes this by doing proper allocation for this cpumask. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-03-04 13:30:22 +08:00
Linus Torvalds	a27341cd5f	Prioritize synchronous signals over 'normal' signals This makes sure that we pick the synchronous signals caused by a processor fault over any pending regular asynchronous signals sent to use by [t]kill(). This is not strictly required semantics, but it makes it _much_ easier for programs like Wine that expect to find the fault information in the signal stack. Without this, if a non-synchronous signal gets picked first, the delayed asynchronous signal will have its signal context pointing to the new signal invocation, rather than the instruction that caused the SIGSEGV or SIGBUS in the first place. This is not all that pretty, and we're discussing making the synchronous signals more explicit rather than have these kinds of implicit preferences of SIGSEGV and friends. See for example http://bugzilla.kernel.org/show_bug.cgi?id=15395 for some of the discussion. But in the meantime this is a simple and fairly straightforward work-around, and the whole if (x & Y) x &= Y; thing can be compiled into (and gcc does do it) just three instructions: movq %rdx, %rax andl $Y, %eax cmovne %rax, %rdx so it is at least a simple solution to a subtle issue. Reported-and-tested-by: Pavel Vilim <wylda@volny.cz> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-03 19:21:10 -08:00
Al Viro	9643f5d94a	Merge branch 'for-fsnotify' into for-linus	2010-03-03 17:12:40 -05:00
Al Viro	1f707137b5	new helper: iterate_mounts() apply function to vfsmounts in set returned by collect_mounts(), stop if it returns non-zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-03 14:07:57 -05:00
Al Viro	2096f759ab	New helper: path_is_under(path1, path2) Analog of is_subdir for vfsmount,dentry pairs, moved from audit_tree.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-03 14:07:55 -05:00
Al Viro	8737c9305b	Switch may_open() and break_lease() to passing O_... ... instead of mixing FMODE_ and O_ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-03 13:00:21 -05:00
Linus Torvalds	2a32f2db13	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: resource: Fix broken indentation resource: Fix generic page_is_ram() for partial RAM pages x86, paravirt: Remove kmap_atomic_pte paravirt op. x86, vmi: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y x86, xen: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y	2010-03-03 09:11:02 -08:00
Linus Torvalds	fb7b096d94	Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) x86: Fix out of order of gsi x86: apic: Fix mismerge, add arch_probe_nr_irqs() again x86, irq: Keep chip_data in create_irq_nr and destroy_irq xen: Remove unnecessary arch specific xen irq functions. smp: Use nr_cpus= to set nr_cpu_ids early x86, irq: Remove arch_probe_nr_irqs sparseirq: Use radix_tree instead of ptrs array sparseirq: Change irq_desc_ptrs to static init: Move radix_tree_init() early irq: Remove unnecessary bootmem code x86: Add iMac9,1 to pci_reboot_dmi_table x86: Convert i8259_lock to raw_spinlock x86: Convert nmi_lock to raw_spinlock x86: Convert ioapic_lock and vector_lock to raw_spinlock x86: Avoid race condition in pci_enable_msix() x86: Fix SCI on IOAPIC != 0 x86, ia32_aout: do not kill argument mapping x86, irq: Move __setup_vector_irq() before the first irq enable in cpu online path x86, irq: Update the vector domain for legacy irqs handled by io-apic x86, irq: Don't block IRQ0_VECTOR..IRQ15_VECTOR's on all cpu's ...	2010-03-03 08:15:37 -08:00
Linus Torvalds	a626b46e17	Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) early_res: Need to save the allocation name in drop_range_partial() sparsemem: Fix compilation on PowerPC early_res: Add free_early_partial() x86: Fix non-bootmem compilation on PowerPC core: Move early_res from arch/x86 to kernel/ x86: Add find_fw_memmap_area Move round_up/down to kernel.h x86: Make 32bit support NO_BOOTMEM early_res: Enhance check_and_double_early_res x86: Move back find_e820_area to e820.c x86: Add find_early_area_size x86: Separate early_res related code from e820.c x86: Move bios page reserve early to head32/64.c sparsemem: Put mem map for one node together. sparsemem: Put usemap for one node together x86: Make 64 bit use early_res instead of bootmem before slab x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA x86: Make early_node_mem get mem > 4 GB if possible x86: Dynamically increase early_res array size x86: Introduce max_early_res and early_res_count ...	2010-03-03 08:15:05 -08:00
Linus Torvalds	0a135ba14d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add __percpu sparse annotations to what's left percpu: add __percpu sparse annotations to fs percpu: add __percpu sparse annotations to core kernel subsystems local_t: Remove leftover local.h this_cpu: Remove pageset_notifier this_cpu: Page allocator conversion percpu, x86: Generic inc / dec percpu instructions local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c module: Use this_cpu_xx to dynamically allocate counters local_t: Remove cpu_local_xx macros percpu: refactor the code in pcpu_[de]populate_chunk() percpu: remove compile warnings caused by __verify_pcpu_ptr() percpu: make accessors check for percpu pointer in sparse percpu: add __percpu for sparse. percpu: make access macros universal percpu: remove per_cpu__ prefix.	2010-03-03 07:34:18 -08:00
Denys Vlasenko	3d9a854c2d	Rename .data[.percpu][.XXX] to .data[..percpu][..XXX]. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>	2010-03-03 11:26:00 +01:00
Lai Jiangshan	ac91d85456	tracing: Fix warning in s_next of trace file ops This warning in s_next() can be triggered by lseek(): [<c018b3f7>] ? s_next+0x77/0x80 [<c013e3c1>] warn_slowpath_common+0x81/0xa0 [<c018b3f7>] ? s_next+0x77/0x80 [<c013e3fa>] warn_slowpath_null+0x1a/0x20 [<c018b3f7>] s_next+0x77/0x80 [<c01efa77>] traverse+0x117/0x200 [<c01eff13>] seq_lseek+0xa3/0x120 [<c01efe70>] ? seq_lseek+0x0/0x120 [<c01d7081>] vfs_llseek+0x41/0x50 [<c01d8116>] sys_llseek+0x66/0xa0 [<c0102bd0>] sysenter_do_call+0x12/0x26 The iterator "leftover" variable is zeroed in the opening of the trace file. But lseek can call s_start() which will call s_next() without reseting the "leftover" variable back to zero, which might trigger the WARN_ON_ONCE(iter->leftover) that is in s_next(). Cc: stable@kernel.org Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-02 21:11:47 -05:00
Linus Torvalds	832d30ca72	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (38 commits) SELinux: Make selinux_kernel_create_files_as() shouldn't just always return 0 TOMOYO: Protect find_task_by_vpid() with RCU. Security: add static to security_ops and default_security_ops variable selinux: libsepol: remove dead code in check_avtab_hierarchy_callback() TOMOYO: Remove __func__ from tomoyo_is_correct_path/domain security: fix a couple of sparse warnings TOMOYO: Remove unneeded parameter. TOMOYO: Use shorter names. TOMOYO: Use enum for index numbers. TOMOYO: Add garbage collector. TOMOYO: Add refcounter on domain structure. TOMOYO: Merge headers. TOMOYO: Add refcounter on string data. TOMOYO: Reduce lines by using common path for addition and deletion. selinux: fix memory leak in sel_make_bools TOMOYO: Extract bitfield syslog: clean up needless comment syslog: use defined constants instead of raw numbers syslog: distinguish between /proc/kmsg and syscalls selinux: allow MLS->non-MLS and vice versa upon policy reload ...	2010-03-02 14:47:24 -08:00
H. Peter Anvin	f41496607e	resource: Fix broken indentation Fix broken indentation in patch `37b99dd537`. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Wu Fengguang <fengguang.wu@intel.com> LKML-Reference: <20100301135551.GA9998@localhost> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-03-02 11:21:09 -08:00
Linus Torvalds	b7f3a209e9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6: sparc: Support show_unhandled_signals. sparc: use __ratelimit sunxvr500: Additional PCI id for sunxvr500 driver sparc: use asm-generic/scatterlist.h sparc64: If 'slot-names' property exist, create sysfs PCI slot information. sparc: remove trailing space in messages sparc: remove redundant return statements	2010-03-02 07:56:44 -08:00
Linus Torvalds	6d6b89bd2e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits) virtio_net: remove forgotten assignment be2net: fix tx completion polling sis190: fix cable detect via link status poll net: fix protocol sk_buff field bridge: Fix build error when IGMP_SNOOPING is not enabled bnx2x: Tx barriers and locks scm: Only support SCM_RIGHTS on unix domain sockets. vhost-net: restart tx poll on sk_sndbuf full vhost: fix get_user_pages_fast error handling vhost: initialize log eventfd context pointer vhost: logging thinko fix wireless: convert to use netdev_for_each_mc_addr ethtool: do not set some flags, if others failed ipoib: returned back addrlen check for mc addresses netlink: Adding inode field to /proc/net/netlink axnet_cs: add new id bridge: Make IGMP snooping depend upon BRIDGE. bridge: Add multicast count/interval sysfs entries bridge: Add hash elasticity/max sysfs entries bridge: Add multicast_snooping sysfs toggle ... Trivial conflicts in Documentation/feature-removal-schedule.txt	2010-03-02 07:55:08 -08:00
Peter Zijlstra	320ebf09cb	perf, x86: Restrict the ANY flag The ANY flag can show SMT data of another task (like 'top'), so we want to disable it when system-wide profiling is disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-02 15:06:46 +01:00
Peter Zijlstra	5671a10e2b	nmi_watchdog: Tell the world we're active Because I was wondering why perf stat wasn't working as expected.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-02 15:02:05 +01:00
john stultz	ad6759fbf3	timekeeping: Prevent oops when GENERIC_TIME=n Aaro Koskinen reported an issue in kernel.org bugzilla #15366, where on non-GENERIC_TIME systems, accessing /sys/devices/system/clocksource/clocksource0/current_clocksource results in an oops. It seems the timekeeper/clocksource rework missed initializing the curr_clocksource value in the !GENERIC_TIME case. Thanks to Aaro for reporting and diagnosing the issue as well as testing the fix! Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: stable@kernel.org LKML-Reference: <1267475683.4216.61.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-02 09:22:25 +01:00
Yinghai Lu	dce46a04d5	early_res: Need to save the allocation name in drop_range_partial() During free_early_partial(), reserve_early_without_check() could end extending the early_res area from __check_and_double_early_res(); as a result, the location of the name for the current reservation could change. Therefore, we need to save a local copy of the name. [ hpa: rewrote comment and checkin description ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B8C7C94.7070000@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-03-01 23:23:02 -08:00
Wu Fengguang	37b99dd537	resource: Fix generic page_is_ram() for partial RAM pages The System RAM walk shall skip partial RAM pages and avoid calling func() on them. So that page_is_ram() return 0 for a partial RAM page. In particular, it shall not call func() with len=0. This fixes a boot time bug reported by Sachin and root caused by Thomas: > >>> WARNING: at arch/x86/mm/ioremap.c:111 __ioremap_caller+0x169/0x2f1() > >>> Hardware name: BladeCenter LS21 -[79716AA]- > >>> Modules linked in: > >>> Pid: 0, comm: swapper Not tainted 2.6.33-git6-autotest #1 > >>> Call Trace: > >>> [<ffffffff81047cff>] ? __ioremap_caller+0x169/0x2f1 > >>> [<ffffffff81063b7d>] warn_slowpath_common+0x77/0xa4 > >>> [<ffffffff81063bb9>] warn_slowpath_null+0xf/0x11 > >>> [<ffffffff81047cff>] __ioremap_caller+0x169/0x2f1 > >>> [<ffffffff813747a3>] ? acpi_os_map_memory+0x12/0x1b > >>> [<ffffffff81047f10>] ioremap_nocache+0x12/0x14 > >>> [<ffffffff813747a3>] acpi_os_map_memory+0x12/0x1b > >>> [<ffffffff81282fa0>] acpi_tb_verify_table+0x29/0x5b > >>> [<ffffffff812827f0>] acpi_load_tables+0x39/0x15a > >>> [<ffffffff8191c8f8>] acpi_early_init+0x60/0xf5 > >>> [<ffffffff818f2cad>] start_kernel+0x397/0x3a7 > >>> [<ffffffff818f2295>] x86_64_start_reservations+0xa5/0xa9 > >>> [<ffffffff818f237a>] x86_64_start_kernel+0xe1/0xe8 > >>> ---[ end trace 4eaa2a86a8e2da22 ]--- > >>> ioremap reserve_memtype failed -22 The return code is -EINVAL, so it failed in the is_ram check, which is not too surprising > BIOS-provided physical RAM map: > BIOS-e820: 0000000000000000 - 000000000009c000 (usable) > BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved) > BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) > BIOS-e820: 0000000000100000 - 00000000cffa3900 (usable) > BIOS-e820: 00000000cffa3900 - 00000000cffa7400 (ACPI data) The ACPI data is not starting on a page boundary and neither does the usable RAM area end on a page boundary. Very useful ! > ACPI: DSDT 00000000cffa3900 036CE (v01 IBM SERLEWIS 00001000 INTL 20060912) ACPI is trying to map DSDT at cffa3900, which results in a check vs. cffa3000 which is the relevant page boundary. The generic is_ram check correctly identifies that as RAM because it's in the usable resource area. The old e820 based is_ram check does not take overlapping resource areas into account. That's why it works. CC: Sachin Sant <sachinp@in.ibm.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> LKML-Reference: <20100301135551.GA9998@localhost> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-03-01 10:18:32 -08:00
Linus Torvalds	b1bf936840	Merge branch 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits) block: don't access jiffies when initialising io_context cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds block: fix for "Consolidate phys_segment and hw_segment limits" cfq-iosched: quantum check tweak blktrace: perform cleanup after setup error blkdev: fix merge_bvec_fn return value checks cfq-iosched: requests "in flight" vs "in driver" clarification cciss: Fix problem with scatter gather elements in the scsi half of the driver cciss: eliminate unnecessary pointer use in cciss scsi code cciss: do not use void pointer for scsi hba data cciss: factor out scatter gather chain block mapping code cciss: fix scatter gather chain block dma direction kludge cciss: simplify scatter gather code cciss: factor out scatter gather chain block allocation and freeing cciss: detect bad alignment of scsi commands at build time cciss: clarify command list padding calculation cfq-iosched: rethink seeky detection for SSDs cfq-iosched: rework seeky detection block: remove padding from io_context on 64bit builds block: Consolidate phys_segment and hw_segment limits ...	2010-03-01 09:00:29 -08:00
Linus Torvalds	0f45339794	Merge branches 'futexes-for-linus', 'irq-core-for-linus' and 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Protect pid lookup in compat code with RCU * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix documentation of default chip disable() * 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: nvram: Drop the BKL from nvram_open()	2010-03-01 08:51:52 -08:00
Linus Torvalds	e56425b135	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers.c: Don't export local functions clocksource: start CMT at clocksource resume clocksource: add suspend callback clocksource: add argument to resume callback ntp: Cleanup xtime references in ntp.c ntp: Make time_esterror and time_maxerror static	2010-03-01 08:48:25 -08:00
Paul E. McKenney	90a6501f94	sched, rcu: Fix rcu_dereference() for RCU-lockdep Make rcu_dereference() of runqueue data structures be rcu_dereference_sched(). Located-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100228163218.GD6846@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-01 09:29:58 +01:00
Ingo Molnar	e2f4699ac1	Merge branch 'linus' into core/rcu Merge reason: Backmerge latest upstream to queue up dependent fix in the scheduler. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-01 09:28:58 +01:00
David S. Miller	4b17764737	sparc: Support show_unhandled_signals. Just faults right now, will add other traps later. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-03-01 00:02:23 -08:00
David S. Miller	47871889c6	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/firmware/iscsi_ibft.c	2010-02-28 19:23:06 -08:00
James Morris	b4ccebdd37	Merge branch 'next' into for-linus	2010-03-01 09:36:31 +11:00
Frederic Weisbecker	1e259e0a99	hw-breakpoints: Remove stub unthrottle callback We support event unthrottling in breakpoint events. It means that if we have more than sysctl_perf_event_sample_rate/HZ, perf will throttle, ignoring subsequent events until the next tick. So if ptrace exceeds this max rate, it will omit events, which breaks the ptrace determinism that is supposed to report every triggered breakpoints. This is likely to happen if we set sysctl_perf_event_sample_rate to 1. This patch removes support for unthrottling in breakpoint events to break throttling and restore ptrace determinism. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: 2.6.33.x <stable@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org>	2010-02-28 20:51:15 +01:00
Linus Torvalds	6f5621cb16	Merge branch 'x86-ptrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-ptrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, ptrace: Remove set_stopped_child_used_math() in [x]fpregs_set x86, ptrace: Simplify xstateregs_get() ptrace: Fix ptrace_regset() comments and diagnose errors specifically parisc: Disable CONFIG_HAVE_ARCH_TRACEHOOK ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET x86, ptrace: regset extensions to support xstate	2010-02-28 10:59:44 -08:00
Dmitry Monakhov	9a8c28c831	blktrace: perform cleanup after setup error Currently even if BLKTRACESETUP ioctl has failed user must call BLKTRACETEARDOWN to be shure what all staff was cleaned, which is contr-intuitive. Let's setup ioctl make necessery cleanup by it self. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-28 19:47:19 +01:00
Frederic Weisbecker	ae1f30384b	tracing: Include irqflags headers from trace clock trace_clock.c includes spinlock.h, which ends up including asm/system.h, which in turn includes linux/irqflags.h in x86. So the definition of raw_local_irq_save is luckily covered there, but this is not the case in parisc: tip/kernel/trace/trace_clock.c:86: error: implicit declaration of function 'raw_local_irq_save' tip/kernel/trace/trace_clock.c:112: error: implicit declaration of function 'raw_local_irq_restore' We need to include linux/irqflags.h directly from trace_clock.c to avoid such build error. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Robert Richter <robert.richter@amd.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-28 19:45:01 +01:00
Linus Torvalds	46bbffad54	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, mm: Unify kernel_physical_mapping_init() API x86, mm: Allow highmem user page tables to be disabled at boot time x86: Do not reserve brk for DMI if it's not going to be used x86: Convert tlbstate_lock to raw_spinlock x86: Use the generic page_is_ram() x86: Remove BIOS data range from e820 Move page_is_ram() declaration to mm.h Generic page_is_ram: use __weak resources: introduce generic page_is_ram()	2010-02-28 10:38:45 -08:00
Linus Torvalds	f66ffdedbf	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) sched: Fix SCHED_MC regression caused by change in sched cpu_power sched: Don't use possibly stale sched_class kthread, sched: Remove reference to kthread_create_on_cpu sched: cpuacct: Use bigger percpu counter batch values for stats counters percpu_counter: Make __percpu_counter_add an inline function on UP sched: Remove member rt_se from struct rt_rq sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] sched: Remove unused update_shares_locked() sched: Use for_each_bit sched: Queue a deboosted task to the head of the RT prio queue sched: Implement head queueing for sched_rt sched: Extend enqueue_task to allow head queueing sched: Remove USER_SCHED sched: Fix the place where group powers are updated sched: Assume *balance is valid sched: Remove load_balance_newidle() sched: Unify load_balance{,_newidle}() sched: Add a lock break for PREEMPT=y sched: Remove from fwd decls sched: Remove rq_iterator from move_one_task ... Fix up trivial conflicts in kernel/sched.c	2010-02-28 10:31:01 -08:00
Linus Torvalds	2531216f23	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix race between ttwu() and task_rq_lock() sched: Fix SMT scheduler regression in find_busiest_queue() sched: Fix sched_mv_power_savings for !SMT kernel/sched.c: Suppress unused var warning	2010-02-28 10:23:41 -08:00
Linus Torvalds	6556a67435	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (172 commits) perf_event, amd: Fix spinlock initialization perf_event: Fix preempt warning in perf_clock() perf tools: Flush maps on COMM events perf_events, x86: Split PMU definitions into separate files perf annotate: Handle samples not at objdump output addr boundaries perf_events, x86: Remove superflous MSR writes perf_events: Simplify code by removing cpu argument to hw_perf_group_sched_in() perf_events, x86: AMD event scheduling perf_events: Add new start/stop PMU callbacks perf_events: Report the MMAP pgoff value in bytes perf annotate: Defer allocating sym_priv->hist array perf symbols: Improve debugging information about symtab origins perf top: Use a macro instead of a constant variable perf symbols: Check the right return variable perf/scripts: Tag syscall_name helper as not yet available perf/scripts: Add perf-trace-python Documentation perf/scripts: Remove unnecessary PyTuple resizes perf/scripts: Add syscall tracing scripts perf/scripts: Add Python scripting engine perf/scripts: Remove check-perf-trace from listed scripts ... Fix trivial conflict in tools/perf/util/probe-event.c	2010-02-28 10:20:25 -08:00
Linus Torvalds	e0d272429a	Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits) ftrace: Add function names to dangling } in function graph tracer tracing: Simplify memory recycle of trace_define_field tracing: Remove unnecessary variable in print_graph_return tracing: Fix typo of info text in trace_kprobe.c tracing: Fix typo in prof_sysexit_enable() tracing: Remove CONFIG_TRACE_POWER from kernel config tracing: Fix ftrace_event_call alignment for use with gcc 4.5 ftrace: Remove memory barriers from NMI code when not needed tracing/kprobes: Add short documentation for HAVE_REGS_AND_STACK_ACCESS_API s390: Add pt_regs register and stack access API tracing/kprobes: Make Kconfig dependencies generic tracing: Unify arch_syscall_addr() implementations tracing: Add notrace to TRACE_EVENT implementation functions ftrace: Allow to remove a single function from function graph filter tracing: Add correct/incorrect to sort keys for branch annotation output tracing: Simplify test for function_graph tracing start point tracing: Drop the tr check from the graph tracing path tracing: Add stack dump to trace_printk if stacktrace option is set tracing: Use appropriate perl constructs in recordmcount.pl tracing: optimize recordmcount.pl for offsets-handling ...	2010-02-28 10:17:55 -08:00
Linus Torvalds	642c4c75a7	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits) rcu: Fix accelerated GPs for last non-dynticked CPU rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot rcu: Fix accelerated grace periods for last non-dynticked CPU rcu: Export rcu_scheduler_active rcu: Make rcu_read_lock_sched_held() take boot time into account rcu: Make lockdep_rcu_dereference() message less alarmist sched, cgroups: Fix module export rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information rcu: Fix rcutorture mod_timer argument to delay one jiffy rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection rcu: Convert to raw_spinlocks rcu: Stop overflowing signed integers rcu: Use canonical URL for Mathieu's dissertation rcu: Accelerate grace period if last non-dynticked CPU rcu: Fix citation of Mathieu's dissertation rcu: Documentation update for CONFIG_PROVE_RCU security: Apply lockdep-based checking to rcu_dereference() uses idr: Apply lockdep-based diagnostics to rcu_dereference() uses radix-tree: Disable RCU lockdep checking in radix tree vfs: Abstract rcu_dereference_check for files-fdtable use ...	2010-02-28 10:13:16 -08:00
Linus Torvalds	f91b22c35f	Merge branches 'core-ipi-for-linus', 'core-locking-for-linus', 'tracing-fixes-for-linus', 'x86-debug-for-linus', 'x86-doc-for-linus', 'x86-gpu-for-linus' and 'x86-rlimit-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: plist: Fix grammar mistake, and c-style mistake * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kprobes: Add mcount to the kprobes blacklist * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86_64: Print modules like i386 does * 'x86-doc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Put 'nopat' in kernel-parameters * 'x86-gpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-64: Allow fbdev primary video code * 'x86-rlimit-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Use helpers for rlimits	2010-02-28 10:04:02 -08:00
Paul E. McKenney	622ea685f1	rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU Make the holdoff only happen when the full number of attempts have been made. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267311188-16603-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-28 09:17:42 +01:00
Tejun Heo	44ee63587d	percpu: Add __percpu sparse annotations to hw_breakpoint Add __percpu sparse annotations to hw_breakpoint. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. In kernel/hw_breakpoint.c, per_cpu(nr_task_bp_pinned, cpu)'s will trigger spurious noderef related warnings from sparse. Changing it to &per_cpu(nr_task_bp_pinned[0], cpu) will work around the problem but deemed to ugly by the maintainer. Leave it alone until better solution can be found. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: K.Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <4B7B4B7A.9050902@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-27 16:23:39 +01:00
Frederic Weisbecker	018cbffe68	Merge commit 'v2.6.33' into perf/core Merge reason: __percpu annotations need the corresponding sparse address space definition upstream. Conflicts: tools/perf/util/probe-event.c (trivial)	2010-02-27 16:18:46 +01:00
Ingo Molnar	480917427b	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2010-02-27 10:41:16 +01:00
Ingo Molnar	6fb83029db	Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core	2010-02-27 10:06:10 +01:00
Paul E. McKenney	71da81324c	rcu: Fix accelerated GPs for last non-dynticked CPU This patch disables irqs across the call to rcu_needs_cpu(). It also enforces a hold-off period so that the idle loop doesn't softirq itself to death when there are lots of RCU callbacks in flight on the last non-dynticked CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267231138-27856-3-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-27 09:53:53 +01:00
Paul E. McKenney	a47cd880b5	rcu: Fix accelerated grace periods for last non-dynticked CPU It is invalid to invoke __rcu_process_callbacks() with irqs disabled, so do it indirectly via raise_softirq(). This requires a state-machine implementation to cycle through the grace-period machinery the required number of times. Located-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267231138-27856-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-27 09:53:52 +01:00
Linus Torvalds	06a79b82b2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Fix preallocating of memory PM / Hibernate: Remove swsusp.c finally PM / Hibernate: Remove trailing space in message PM: Allow SCSI devices to suspend/resume asynchronously PM: Allow USB devices to suspend/resume asynchronously USB: implement non-tree resume ordering constraints for PCI host controllers PM: Allow PCI devices to suspend/resume asynchronously PM / Hibernate: Swap, remove useless check from swsusp_read() PM / Hibernate: Really deprecate deprecated user ioctls PM: Allow device drivers to use dpm_wait() PM: Start asynchronous resume threads upfront PM: Add facility for advanced testing of async suspend/resume PM: Add a switch for disabling/enabling asynchronous suspend/resume PM: Asynchronous suspend and resume of devices PM: Add parent information to timing messages PM: Document device power attributes in sysfs PM / Runtime: Add sysfs switch for disabling device run-time PM	2010-02-26 17:22:53 -08:00
Linus Torvalds	37d4008484	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (31 commits) crypto: aes_generic - Fix checkpatch errors crypto: fcrypt - Fix checkpatch errors crypto: ecb - Fix checkpatch errors crypto: des_generic - Fix checkpatch errors crypto: deflate - Fix checkpatch errors crypto: crypto_null - Fix checkpatch errors crypto: cipher - Fix checkpatch errors crypto: crc32 - Fix checkpatch errors crypto: compress - Fix checkpatch errors crypto: cast6 - Fix checkpatch errors crypto: cast5 - Fix checkpatch errors crypto: camellia - Fix checkpatch errors crypto: authenc - Fix checkpatch errors crypto: api - Fix checkpatch errors crypto: anubis - Fix checkpatch errors crypto: algapi - Fix checkpatch errors crypto: blowfish - Fix checkpatch errors crypto: aead - Fix checkpatch errors crypto: ablkcipher - Fix checkpatch errors crypto: pcrypt - call the complete function on error ...	2010-02-26 16:50:02 -08:00
Steven Rostedt	f1c7f517a5	ftrace: Add function names to dangling } in function graph tracer The function graph tracer is currently the most invasive tracer in the ftrace family. It can easily overflow the buffer even with 10megs per CPU. This means that events can often be lost. On start up, or after events are lost, if the function return is recorded but the function enter was lost, all we get to see is the exiting '}'. Here is how a typical trace output starts: [tracing] cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| 0) + 91.897 us \| } 0) ! 567.961 us \| } 0) <========== \| 0) ! 579.083 us \| _raw_spin_lock_irqsave(); 0) 4.694 us \| _raw_spin_unlock_irqrestore(); 0) ! 594.862 us \| } 0) ! 603.361 us \| } 0) ! 613.574 us \| } 0) ! 623.554 us \| } 0) 3.653 us \| fget_light(); 0) \| sock_poll() { There are a series of '}' with no matching "func() {". There's no information to what functions these ending brackets belong to. This patch adds a stack on the per cpu structure used in outputting the function graph tracer to keep track of what function was outputted. Then on a function exit event, it checks the depth to see if the function exit has a matching entry event. If it does, then it only prints the '}', otherwise it adds the function name after the '}'. This allows function exit events to show what function they belong to at trace output startup, when the entry was lost due to ring buffer overflow, or even after a new task is scheduled in. Here is what the above trace will look like after this patch: [tracing] cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| 0) + 91.897 us \| } (irq_exit) 0) ! 567.961 us \| } (smp_apic_timer_interrupt) 0) <========== \| 0) ! 579.083 us \| _raw_spin_lock_irqsave(); 0) 4.694 us \| _raw_spin_unlock_irqrestore(); 0) ! 594.862 us \| } (add_wait_queue) 0) ! 603.361 us \| } (__pollwait) 0) ! 613.574 us \| } (tcp_poll) 0) ! 623.554 us \| } (sock_poll) 0) 3.653 us \| fget_light(); 0) \| sock_poll() { Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-26 19:25:53 -05:00
Rafael J. Wysocki	a9c9b4429d	PM / Hibernate: Fix preallocating of memory The hibernate memory preallocation code allocates memory to push some user space data out of physical RAM, so that the hibernation image is not too large. It allocates more memory than necessary for creating the image, so it has to release some pages to make room for allocations made while suspending devices and disabling nonboot CPUs, or the system will hang due to the lack of free pages to allocate from. Unfortunately, the function used for freeing these pages, free_unnecessary_pages(), contains a bug that prevents it from doing the job on all systems without highmem. Fix this problem, which is a regression from the 2.6.30 kernel, by using the right condition for the termination of the loop in free_unnecessary_pages(). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-and-tested-by: Alan Jenkins <sourcejedi.lkml@googlemail.com> Cc: stable@kernel.org	2010-02-26 20:39:13 +01:00
Jiri Slaby	f8bb0db818	PM / Hibernate: Remove swsusp.c finally Its contents and entry in Makefile were already removed in `8e60c6a134` (Shift remaining code from swsusp.c to hibernate.c) but somehow it remained in-place (rjw: which most likely was my mistake). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:13 +01:00
Frans Pop	07c3bb5797	PM / Hibernate: Remove trailing space in message Remove a trailing space from a message in swsusp_save(). Signed-off-by: Frans Pop <elendil@planet.nl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:13 +01:00
Jiri Slaby	09c09bc618	PM / Hibernate: Swap, remove useless check from swsusp_read() It will never reach here if the sws_resume_bdev is erratic. swsusp_read() is called only from software_resume(), but after swsusp_check() which would catch the error state. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:11 +01:00
Jiri Slaby	b694e52ebd	PM / Hibernate: Really deprecate deprecated user ioctls They were deprecated and removed from exported headers more than 2 years ago. Inform users about their removal in the future now. (Switch cases needed to be reorderded for an easy fall through.) And add an entry to feature-removal-schedule. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:11 +01:00
Rafael J. Wysocki	5a2eb8585f	PM: Add facility for advanced testing of async suspend/resume Add configuration switch CONFIG_PM_ADVANCED_DEBUG for compiling in extra PM debugging/testing code allowing one to access some PM-related attributes of devices from the user space via sysfs. If CONFIG_PM_ADVANCED_DEBUG is set, add sysfs attribute power/async for every device allowing the user space to access the device's power.async_suspend flag and modify it, if desired. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:10 +01:00
Rafael J. Wysocki	0e06b4a891	PM: Add a switch for disabling/enabling asynchronous suspend/resume Add sysfs attribute /sys/power/pm_async allowing the user space to disable/enable asynchronous suspend/resume of devices. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:10 +01:00
Linus Torvalds	68c6b85984	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (48 commits) x86/PCI: Prevent mmconfig memory corruption ACPI: Use GPE reference counting to support shared GPEs x86/PCI: use host bridge _CRS info by default on 2008 and newer machines PCI: augment bus resource table with a list PCI: add pci_bus_for_each_resource(), remove direct bus->resource[] refs PCI: read bridge windows before filling in subtractive decode resources PCI: split up pci_read_bridge_bases() PCIe PME: use pci_pcie_cap() PCI PM: Run-time callbacks for PCI bus type PCIe PME: use pci_is_pcie() PCI / ACPI / PM: Platform support for PCI PME wake-up ACPI / ACPICA: Multiple system notify handlers per device ACPI / PM: Add more run-time wake-up fields ACPI: Use GPE reference counting to support shared GPEs PCI PM: Make it possible to force using INTx for PCIe PME signaling PCI PM: PCIe PME root port service driver PCI PM: Add function for checking PME status of devices PCI: mark is_pcie obsolete PCI: set PCI_PREF_RANGE_TYPE_64 in pci_bridge_check_ranges PCI: pciehp: second try to get big range for pcie devices ...	2010-02-26 10:35:27 -08:00
Peter Zijlstra	24691ea964	perf_event: Fix preempt warning in perf_clock() A recent commit introduced a preemption warning for perf_clock(), use raw_smp_processor_id() to avoid this, it really doesn't matter which cpu we use here. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1267198583.22519.684.camel@laptop> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 17:25:00 +01:00
Suresh Siddha	dd5feea14a	sched: Fix SCHED_MC regression caused by change in sched cpu_power On platforms like dual socket quad-core platform, the scheduler load balancer is not detecting the load imbalances in certain scenarios. This is leading to scenarios like where one socket is completely busy (with all the 4 cores running with 4 tasks) and leaving another socket completely idle. This causes performance issues as those 4 tasks share the memory controller, last-level cache bandwidth etc. Also we won't be taking advantage of turbo-mode as much as we would like, etc. Some of the comparisons in the scheduler load balancing code are comparing the "weighted cpu load that is scaled wrt sched_group's cpu_power" with the "weighted average load per task that is not scaled wrt sched_group's cpu_power". While this has probably been broken for a longer time (for multi socket numa nodes etc), the problem got aggrevated via this recent change: \| \| commit `f93e65c186` \| Author: Peter Zijlstra <a.p.zijlstra@chello.nl> \| Date: Tue Sep 1 10:34:32 2009 +0200 \| \| sched: Restore __cpu_power to a straight sum of power \| Also with this change, the sched group cpu power alone no longer reflects the group capacity that is needed to implement MC, MT performance (default) and power-savings (user-selectable) policies. We need to use the computed group capacity (sgs.group_capacity, that is computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to find out if the group with the max load is above its capacity and how much load to move etc. Reported-by: Ma Ling <ling.ma@intel.com> Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> [ -v2: build fix ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x] LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 15:45:13 +01:00
Peter Zijlstra	6e37738a2f	perf_events: Simplify code by removing cpu argument to hw_perf_group_sched_in() Since the cpu argument to hw_perf_group_sched_in() is always smp_processor_id(), simplify the code a little by removing this argument and using the current cpu where needed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1265890918.5396.3.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:53 +01:00
Stephane Eranian	38331f62c2	perf_events, x86: AMD event scheduling This patch adds correct AMD NorthBridge event scheduling. NB events are events measuring L3 cache, Hypertransport traffic. They are identified by an event code >= 0xe0. They measure events on the Northbride which is shared by all cores on a package. NB events are counted on a shared set of counters. When a NB event is programmed in a counter, the data actually comes from a shared counter. Thus, access to those counters needs to be synchronized. We implement the synchronization such that no two cores can be measuring NB events using the same counters. Thus, we maintain a per-NB allocation table. The available slot is propagated using the event_constraint structure. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4b703957.0702d00a.6bf2.7b7d@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:53 +01:00
Stephane Eranian	d76a0812ac	perf_events: Add new start/stop PMU callbacks In certain situations, the kernel may need to stop and start the same event rapidly. The current PMU callbacks do not distinguish between stop and release (i.e., stop + free the resource). Thus, a counter may be released, then it will be immediately re-acquired. Event scheduling will again take place with no guarantee to assign the same counter. On some processors, this may event yield to failure to assign the event back due to competion between cores. This patch is adding a new pair of callback to stop and restart a counter without actually release the underlying counter resource. On stop, the counter is stopped, its values saved and that's it. On start, the value is reloaded and counter is restarted (on x86, actual restart is delayed until perf_enable()). Signed-off-by: Stephane Eranian <eranian@google.com> [ added fallback to ->enable/->disable for all other PMUs fixed x86_pmu_start() to call x86_pmu.enable() merged __x86_pmu_disable into x86_pmu_stop() ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4b703875.0a04d00a.7896.ffffb824@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:53 +01:00
Peter Zijlstra	3a0304e90a	perf_events: Report the MMAP pgoff value in bytes DaveM reported that currently perf interprets the pgoff value reported by the MMAP events as a byte range, but the kernel reports it as a page offset. Since its broken (and unusable) anyway, change the kernel behaviour (ABI) to report bytes indeed, avoiding the need for userspace to deal with PAGE_SIZE things. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:52 +01:00
Ingo Molnar	281b3714e9	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2010-02-26 09:20:17 +01:00
Ingo Molnar	64b9fb5704	Merge commit 'v2.6.33' into tracing/core Conflicts: scripts/recordmcount.pl Merge reason: Merge up to v2.6.33. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 09:18:32 +01:00
Yinghai Lu	fb90ef93df	early_res: Add free_early_partial() To free partial areas in pcpu_setup... Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <4B85E245.5030001@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:25:35 +01:00
Paul E. McKenney	f5f6540964	rcu: Export rcu_scheduler_active Kernel modules using rcu_read_lock_sched_held() must now have access to rcu_scheduler_active, so it must be exported. This should fix the fix for the boot-time RCU-lockdep splat. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100226030230.GA7743@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:20:46 +01:00
Paul E. McKenney	d9f1bb6ad7	rcu: Make rcu_read_lock_sched_held() take boot time into account Before the scheduler starts, all tasks are non-preemptible by definition. So, during that time, rcu_read_lock_sched_held() needs to always return "true". This patch makes that be so. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267135607-7056-2-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:20:46 +01:00
Paul E. McKenney	056ba4a9be	rcu: Make lockdep_rcu_dereference() message less alarmist Change from "unsafe" to "suspicious", given that there will be false alarms. Suggested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267135607-7056-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:20:46 +01:00
Masami Hiramatsu	b2be84df99	kprobes: Jump optimization sysctl interface Add /proc/sys/debug/kprobes-optimization sysctl which enables and disables kprobes jump optimization on the fly for debugging. Changes in v7: - Remove ctl_name = CTL_UNNUMBERED for upstream compatibility. Changes in v6: - Update comments and coding style. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> LKML-Reference: <20100225133415.6725.8274.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 17:49:25 +01:00
Masami Hiramatsu	afd66255b9	kprobes: Introduce kprobes jump optimization Introduce kprobes jump optimization arch-independent parts. Kprobes uses breakpoint instruction for interrupting execution flow, on some architectures, it can be replaced by a jump instruction and interruption emulation code. This gains kprobs' performance drastically. To enable this feature, set CONFIG_OPTPROBES=y (default y if the arch supports OPTPROBE). Changes in v9: - Fix a bug to optimize probe when enabling. - Check nearby probes can be optimize/unoptimize when disarming/arming kprobes, instead of registering/unregistering. This will help kprobe-tracer because most of probes on it are usually disabled. Changes in v6: - Cleanup coding style for readability. - Add comments around get/put_online_cpus(). Changes in v5: - Use get_online_cpus()/put_online_cpus() for avoiding text_mutex deadlock. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> LKML-Reference: <20100225133407.6725.81992.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 17:49:24 +01:00
Masami Hiramatsu	4610ee1d36	kprobes: Introduce generic insn_slot framework Make insn_slot framework support various size slots. Current insn_slot just supports one-size instruction buffer slot. However, kprobes jump optimization needs larger size buffers. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> LKML-Reference: <20100225133358.6725.82430.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>	2010-02-25 17:49:24 +01:00
Wenji Huang	7b60997f73	tracing: Simplify memory recycle of trace_define_field Discard freeing field->type since it is not necessary. Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-5-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:42:55 -05:00
Wenji Huang	c85f3a91f8	tracing: Remove unnecessary variable in print_graph_return The "cpu" variable is declared at the start of the function and also within a branch, with the exact same initialization. Remove the local variable of the same name in the branch. Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-3-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:41:24 -05:00
Wenji Huang	a5efd92511	tracing: Fix typo of info text in trace_kprobe.c Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-2-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:36:29 -05:00
Wenji Huang	6574658b3b	tracing: Fix typo in prof_sysexit_enable() Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-1-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:35:55 -05:00
Li Zefan	1ab83a8941	tracing: Remove CONFIG_TRACE_POWER from kernel config The power tracer has been converted to power trace events. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B84D50E.4070806@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:31:45 -05:00
Jeff Mahoney	86c38a31aa	tracing: Fix ftrace_event_call alignment for use with gcc 4.5 GCC 4.5 introduces behavior that forces the alignment of structures to use the largest possible value. The default value is 32 bytes, so if some structures are defined with a 4-byte alignment and others aren't declared with an alignment constraint at all - it will align at 32-bytes. For things like the ftrace events, this results in a non-standard array. When initializing the ftrace subsystem, we traverse the _ftrace_events section and call the initialization callback for each event. When the structures are misaligned, we could be treating another part of the structure (or the zeroed out space between them) as a function pointer. This patch forces the alignment for all the ftrace_event_call structures to 4 bytes. Without this patch, the kernel fails to boot very early when built with gcc 4.5. It's trivial to check the alignment of the members of the array, so it might be worthwhile to add something to the build system to do that automatically. Unfortunately, that only covers this case. I've asked one of the gcc developers about adding a warning when this condition is seen. Cc: stable@kernel.org Signed-off-by: Jeff Mahoney <jeffm@suse.com> LKML-Reference: <4B85770B.6010901@suse.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 09:38:11 -05:00
Don Zickus	47195d5763	nmi_watchdog: Clean up various small details Mostly copy/paste whitespace damage with a couple of nitpicks by the checkpatch script. Fix the struct definition as requested by Ingo too. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: aris@redhat.com LKML-Reference: <1266880143-24943-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> -- arch/x86/kernel/apic/hw_nmi.c \| 14 +++++------ arch/x86/kernel/traps.c \| 6 ++-- include/linux/nmi.h \| 2 - kernel/nmi_watchdog.c \| 51 ++++++++++++++++++++---------------------- 4 files changed, 36 insertions(+), 37 deletions(-)	2010-02-25 12:40:50 +01:00
Ingo Molnar	c50cc75271	sched, cgroups: Fix module export I have exported it in `d11c563` - but cgroups.c did not have module.h included ... Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 12:02:13 +01:00
Paul E. McKenney	1ed509a225	rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information When RCU detects a grace-period stall, it currently just prints out the PID of any tasks doing the stalling. This patch adds RCU_CPU_STALL_VERBOSE, which enables the more-verbose reporting from sched_show_task(). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-21-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:35:02 +01:00
Paul E. McKenney	6155fec92e	rcu: Fix rcutorture mod_timer argument to delay one jiffy The current "mod_timer(&t, 1)" potentially makes the timer fire immediately, change this to wait one jiffy. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-20-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:35:00 +01:00
Paul E. McKenney	3acd9eb31c	rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection Under TREE_PREEMPT_RCU, print_other_cpu_stall() invokes rcu_print_task_stall() with the root rcu_node structure's ->lock held, and rcu_print_task_stall() acquires that same lock for self-deadlock. Fix this by removing the lock acquisition from rcu_print_task_stall(), and making all callers acquire the lock instead. Tested-by: John Kacur <jkacur@redhat.com> Tested-by: Thomas Gleixner <tglx@linutronix.de> Located-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-19-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:59 +01:00
Paul E. McKenney	1304afb225	rcu: Convert to raw_spinlocks The spinlocks in rcutree need to be real spinlocks in preempt-rt. Convert them to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-18-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:58 +01:00
Paul E. McKenney	20133cfce7	rcu: Stop overflowing signed integers The C standard does not specify the result of an operation that overflows a signed integer, so such operations need to be avoided. This patch changes the type of several fields from "long" to "unsigned long" and adjusts operations as needed. ULONG_CMP_GE() and ULONG_CMP_LT() macros are introduced to do the modular comparisons that are appropriate given that overflow is an expected event. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-17-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:57 +01:00
Paul E. McKenney	8bd93a2c5d	rcu: Accelerate grace period if last non-dynticked CPU Currently, rcu_needs_cpu() simply checks whether the current CPU has an outstanding RCU callback, which means that the last CPU to go into dyntick-idle mode might wait a few ticks for the relevant grace periods to complete. However, if all the other CPUs are in dyntick-idle mode, and if this CPU is in a quiescent state (which it is for RCU-bh and RCU-sched any time that we are considering going into dyntick-idle mode), then the grace period is instantly complete. This patch therefore repeatedly invokes the RCU grace-period machinery in order to force any needed grace periods to complete quickly. It does so a limited number of times in order to prevent starvation by an RCU callback function that might pass itself to call_rcu(). However, if any CPU other than the current one is not in dyntick-idle mode, fall back to simply checking (with fix to bug noted by Lai Jiangshan). Also, take advantage of last grace-period forcing, the opportunity to do so noted by Steve Rostedt. And apply simplified #ifdef condition suggested by Frederic Weisbecker. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-15-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:55 +01:00
Paul E. McKenney	497f0ab39c	sched: Better name for for_each_domain_rd As suggested by Peter Ziljstra, make better choice of name for for_each_domain_rd(), containing "rcu_dereference", given that it is but a wrapper for rcu_dereference_check(). The name rcu_dereference_check_sched_domain() does that and provides a separate per-subsystem name space. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-7-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:47 +01:00
Paul E. McKenney	d11c563dd2	sched: Use lockdep-based checking on rcu_dereference() Update the rcu_dereference() usages to take advantage of the new lockdep-based checking. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com> [ -v2: fix allmodconfig missing symbol export build failure on x86 ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:26 +01:00
Paul E. McKenney	0632eb3d75	rcu: Integrate rcu_dereference_check() message into lockdep Make rcu_dereference_check() print the list of held locks in addition to the stack dump to ease debugging. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-3-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 09:41:01 +01:00
Paul E. McKenney	632ee20013	rcu: Introduce lockdep-based checking to RCU read-side primitives Inspection is proving insufficient to catch all RCU misuses, which is understandable given that rcu_dereference() might be protected by any of four different flavors of RCU (RCU, RCU-bh, RCU-sched, and SRCU), and might also/instead be protected by any of a number of locking primitives. It is therefore time to enlist the aid of lockdep. This set of patches is inspired by earlier work by Peter Zijlstra and Thomas Gleixner, and takes the following approach: o Set up separate lockdep classes for RCU, RCU-bh, and RCU-sched. o Set up separate lockdep classes for each instance of SRCU. o Create primitives that check for being in an RCU read-side critical section. These return exact answers if lockdep is fully enabled, but if unsure, report being in an RCU read-side critical section. (We want to avoid false positives!) The primitives are: For RCU: rcu_read_lock_held(void) For RCU-bh: rcu_read_lock_bh_held(void) For RCU-sched: rcu_read_lock_sched_held(void) For SRCU: srcu_read_lock_held(struct srcu_struct *sp) o Add rcu_dereference_check(), which takes a second argument in which one places a boolean expression based on the above primitives and/or lockdep_is_held(). o A new kernel configuration parameter, CONFIG_PROVE_RCU, enables rcu_dereference_check(). This depends on CONFIG_PROVE_LOCKING, and should be quite helpful during the transition period while CONFIG_PROVE_RCU-unaware patches are in flight. The existing rcu_dereference() primitive does no checking, but upcoming patches will change that. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 09:40:59 +01:00
Ingo Molnar	996de8c6fe	Merge commit 'v2.6.33' into core/rcu Merge reason: Update from -rc4 to -final. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 09:40:26 +01:00
Suresh Siddha	c6a0dd7ec6	ptrace: Fix ptrace_regset() comments and diagnose errors specifically Return -EINVAL for the bad size and for unrecognized NT_* type in ptrace_regset() instead of -EIO. Also update the comments for this ptrace interface with more clarifications. Requested-by: Roland McGrath <roland@redhat.com> Requested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20100222225240.397523600@sbs-t61.sc.intel.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-23 13:45:26 -08:00
Tetsuo Handa	701188374b	kernel/sys.c: fix missing rcu protection for sys_getpriority() find_task_by_vpid() is not safe without rcu_read_lock(). 2.6.33-rc7 got RCU protection for sys_setpriority() but missed it for sys_getpriority(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-22 19:50:34 -08:00
Rafael J. Wysocki	6cbf82148f	PCI PM: Run-time callbacks for PCI bus type Introduce run-time PM callbacks for the PCI bus type. Make the new callbacks work in analogy with the existing system sleep PM callbacks, so that the drivers already converted to struct dev_pm_ops can use their suspend and resume routines for run-time PM without modifications. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:21:19 -08:00
Yinghai Lu	5eeec0ec93	resource: add release_child_resources Useful for freeing a portion of the resource tree, e.g. when trying to reallocate resources more efficiently. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:17:00 -08:00
Dominik Brodowski	3b7a17fcda	resource/PCI: mark struct resource as const Now that we return the new resource start position, there is no need to update "struct resource" inside the align function. Therefore, mark the struct resource as const. Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:16:57 -08:00
Dominik Brodowski	b26b2d494b	resource/PCI: align functions now return start of resource As suggested by Linus, align functions should return the start of a resource, not void. An update of "res->start" is no longer necessary. Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:16:56 -08:00
Linus Torvalds	bee415ce42	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf probe: Init struct probe_point and set counter correctly hw-breakpoint: Keep track of dr7 local enable bits hw-breakpoints: Accept breakpoints on NULL address perf_events: Fix FORK events	2010-02-22 08:55:32 -08:00
Alexey Dobriyan	b54452b07a	const: struct nla_policy Make remaining netlink policies as const. Fixup coding style where needed. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-18 14:30:18 -08:00
Yinghai Lu	b5eb78f76d	sparseirq: Use radix_tree instead of ptrs array Use radix_tree irq_desc_tree instead of irq_desc_ptrs. -v2: according to Eric and cyrill to use radix_tree_lookup_slot and radix_tree_replace_slot Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-32-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-17 17:27:20 -08:00
Yinghai Lu	99558f0bbe	sparseirq: Change irq_desc_ptrs to static Add replace_irq_desc() instead of poking at the array directly. -v2: remove unneeded boundary check in replace_irq_desc Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-31-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-17 17:27:03 -08:00
Yinghai Lu	febcb0c59a	irq: Remove unnecessary bootmem code mem_init is moved early already. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-29-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-17 17:23:59 -08:00
Thomas Gleixner	b7e56edba4	Merge branch 'linus' into x86/mm x86/mm is on 32-rc4 and missing the spinlock namespace changes which are needed for further commits into this topic. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-17 18:28:05 +01:00
Heiko Carstens	f850c30c8b	tracing/kprobes: Make Kconfig dependencies generic KPROBES_EVENT actually depends on the regs and stack access API (`b1cf540f`) and not on x86. So introduce a new config option which architectures can select if they have the API implemented and switch x86. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <20100210162517.GB6933@osiris.boeblingen.de.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-17 13:13:08 +01:00
Mike Frysinger	e7b8e675d9	tracing: Unify arch_syscall_addr() implementations Most implementations of arch_syscall_addr() are the same, so create a default version in common code and move the one piece that differs (the syscall table) to asm/syscall.h. New arch ports don't have to waste time copying & pasting this simple function. The s390/sparc versions need to be different, so document why. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1264498803-17278-1-git-send-email-vapier@gentoo.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-17 13:07:21 +01:00
Thomas Gleixner	83ab0aa0d5	sched: Don't use possibly stale sched_class setscheduler() saves task->sched_class outside of the rq->lock held region for a check after the setscheduler changes have become effective. That might result in checking a stale value. rtmutex_setprio() has the same problem, though it is protected by p->pi_lock against setscheduler(), but for correctness sake (and to avoid bad examples) it needs to be fixed as well. Retrieve task->sched_class inside of the rq->lock held region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org	2010-02-17 11:58:18 +01:00
Don Zickus	96ca4028ac	nmi_watchdog: Properly configure for software events Paul Mackerras brought up a good point that when fallbacking to software events, I may have been lucky in my configuration. Modified the code to explicit provide a new configuration for software events. Suggested-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1266357745-26671-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-17 08:16:02 +01:00
Don Zickus	6081b6cd97	nmi_watchdog: support for oprofile Re-arrange the code so that when someone disables nmi_watchdog with: echo 0 > /proc/sys/kernel/nmi_watchdog it releases the hardware reservation on the PMUs. This allows the oprofile module to grab those PMUs and do its thing. Otherwise oprofile fails to load because the hardware is reserved by the perf_events subsystem. Tested using: oprofile --vm-linux --start and watched it failed when nmi_watchdog is enabled and succeed when: oprofile --deinit && echo 0 > /proc/sys/kernel/nmi_watchdog is run. Note: this has the side quirk of having the nmi_watchdog latch onto the software events instead of hardware events if oprofile has already reserved the hardware first. User beware! :-) Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: eranian@google.com LKML-Reference: <1266357892-30504-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-17 08:16:02 +01:00
Yinghai Lu	580e0ad21d	core: Move early_res from arch/x86 to kernel/ This makes the range reservation feature available to other architectures. -v2: add get_max_mapped, max_pfn_mapped only defined in x86... to fix PPC compiling -v3: according to hpa, add CONFIG_HAVE_EARLY_RES -v4: fix typo about EARLY_RES in config Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B7B5723.4070009@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-16 21:43:39 -08:00
Tejun Heo	43cf38eb5c	percpu: add __percpu sparse annotations to core kernel subsystems Add __percpu sparse annotations to core subsystems. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-mm@kvack.org Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Biederman <ebiederm@xmission.com>	2010-02-17 11:17:38 +09:00
Anton Vorontsov	5a5e0f4c70	kfifo: Don't use integer as NULL pointer This patch fixes following sparse warnings: include/linux/kfifo.h:127:25: warning: Using plain integer as NULL pointer kernel/kfifo.c:83:21: warning: Using plain integer as NULL pointer Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Acked-by: Stefani Seibold <stefani@seibold.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-02-16 15:11:08 -08:00
Anton Vorontsov	1a02d59aba	kfifo: Make kfifo_initialized work after kfifo_free After kfifo rework it's no longer possible to reliably know if kfifo is usable, since after kfifo_free(), kfifo_initialized() would still return true. The correct behaviour is needed for at least FHCI USB driver. This patch fixes the issue by resetting the kfifo to zero values (the same approach is used in kfifo_alloc() if allocation failed). Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Acked-by: Stefani Seibold <stefani@seibold.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-02-16 15:11:06 -08:00
Thomas Gleixner	6e40f5bbbc	Merge branch 'sched/urgent' into sched/core Conflicts: kernel/sched.c Necessary due to the urgent fixes which conflict with the code move from sched.c to sched_fair.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-16 16:48:56 +01:00
Peter Zijlstra	0970d2992d	sched: Fix race between ttwu() and task_rq_lock() Thomas found that due to ttwu() changing a task's cpu without holding the rq->lock, task_rq_lock() might end up locking the wrong rq. Avoid this by serializing against TASK_WAKING. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1266241712.15770.420.camel@laptop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-16 15:13:59 +01:00
Suresh Siddha	9000f05c6d	sched: Fix SMT scheduler regression in find_busiest_queue() Fix a SMT scheduler performance regression that is leading to a scenario where SMT threads in one core are completely idle while both the SMT threads in another core (on the same socket) are busy. This is caused by this commit (with the problematic code highlighted) commit `bdb94aa5db` Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue Sep 1 10:34:38 2009 +0200 sched: Try to deal with low capacity @@ -4203,15 +4223,18 @@ find_busiest_queue() ... for_each_cpu(i, sched_group_cpus(group)) { + unsigned long power = power_of(i); ... - wl = weighted_cpuload(i); + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; + wl /= power; - if (rq->nr_running == 1 && wl > imbalance) + if (capacity && rq->nr_running == 1 && wl > imbalance) continue; On a SMT system, power of the HT logical cpu will be 589 and the scheduler load imbalance (for scenarios like the one mentioned above) can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling the weighted load with the power will result in "wl > imbalance" and ultimately resulting in find_busiest_queue() return NULL, causing load_balance() to think that the load is well balanced. But infact one of the tasks can be moved to the idle core for optimal performance. We don't need to use the weighted load (wl) scaled by the cpu power to compare with imabalance. In that condition, we already know there is only a single task "rq->nr_running == 1" and the comparison between imbalance, wl is to make sure that we select the correct priority thread which matches imbalance. So we really need to compare the imabalnce with the original weighted load of the cpu and not the scaled load. But in other conditions where we want the most hammered(busiest) cpu, we can use scaled load to ensure that we consider the cpu power in addition to the actual load on that cpu, so that we can move the load away from the guy that is getting most hammered with respect to the actual capacity, as compared with the rest of the cpu's in that busiest group. Fix it. Reported-by: Ma Ling <ling.ma@intel.com> Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com> Cc: stable@kernel.org [2.6.32.x] Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-16 15:13:59 +01:00
Linus Torvalds	7d0bab9dfe	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer, softirq: Fix hrtimer->softirq trampoline	2010-02-15 19:52:12 -08:00
Linus Torvalds	627a9a194d	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/kprobes: Fix probe parsing tracing: Fix circular dead lock in stack trace	2010-02-15 19:47:59 -08:00
Linus Torvalds	3d8b4bdef7	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf top: Fix help text alignment perf: Fix hypervisor sample reporting perf: Make bp_len type to u64 generic across the arch	2010-02-15 19:47:48 -08:00
Uwe Kleine-König	dfff0615d2	tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-15 15:38:10 +01:00
Masami Hiramatsu	8b833c506c	kprobes: Add mcount to the kprobes blacklist Since mcount function can be called from everywhere, it should be blacklisted. Moreover, the "mcount" symbol is a special symbol name. So, it is better to put it in the generic blacklist. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-15 05:45:49 +01:00
Peter Zijlstra	6f93d0a7c8	perf_events: Fix FORK events Commit `22e19085` ("Honour event state for aux stream data") introduced a bug where we would drop FORK events. The thing is that we deliver FORK events to the child process' event, which at that time will be PERF_EVENT_STATE_INACTIVE because the child won't be scheduled in (we're in the middle of fork). Solve this twice, change the event state filter to exclude only disabled (STATE_OFF) or worse, and deliver FORK events to the current (parent). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> LKML-Reference: <1266142324.5273.411.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 18:10:39 +01:00
Heiko Carstens	a9bb18f36c	tracing/kprobes: Fix probe parsing Trying to add a probe like: echo p:myprobe 0x10000 > /sys/kernel/debug/tracing/kprobe_events will fail since the wrong pointer is passed to strict_strtoul when trying to convert the address to an unsigned long. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100210162346.GA6933@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 09:43:58 +01:00
Don Zickus	cf454aecb3	nmi_watchdog: Fallback to software events when no hardware pmu detected Not all arches have a PMU or have perf_event support for their PMU. The nmi_watchdog will fail in those cases. Fallback to using software events to generate nmi_watchdog traffic with local apic interrupts. Tested on a Pentium4 and it worked as expected, excepting for detecting cpu lockups. The problem with using software events as a cpu lock up detector is the nmi_watchdog uses the logic that if local apic interrupts stop incrementing then the cpu is probably locked up. But with software events we use the local apic to trigger the nmi_watchdog callback to see if local apic interrupts are still firing, which obviously they are otherwise we wouldn't have been triggered. The algorithm to detect cpu lock ups is the same as the old nmi_watchdog. Perhaps we need to find a better way to detect lock ups? Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: aris@redhat.com LKML-Reference: <1266013161-31197-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 09:19:44 +01:00
Don Zickus	504d7cf10e	nmi_watchdog: Compile and portability fixes The original patch was x86_64 centric. Changed the code to make it less so. ested by building and running on a powerpc. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: aris@redhat.com LKML-Reference: <1266013161-31197-2-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 09:19:43 +01:00
Suresh Siddha	2225a122ae	ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET Generic support for PTRACE_GETREGSET/PTRACE_SETREGSET commands which export the regsets supported by each architecture using the correponding NT_* types. These NT_* types are already part of the userland ABI, used in representing the architecture specific register sets as different NOTES in an ELF core file. 'addr' parameter for the ptrace system call encode the REGSET type (using the corresppnding NT_* type) and the 'data' parameter points to the struct iovec having the user buffer and the length of that buffer. struct iovec iov = { buf, len}; ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, &iov); On successful completion, iov.len will be updated by the kernel specifying how much the kernel has written/read to/from the user's iov.buf. x86 extended state registers are primarily exported using this interface. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20100211195614.886724710@sbs-t61.sc.intel.com> Acked-by: Hongjiu Lu <hjl.tools@gmail.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-11 15:08:33 -08:00
Li Zefan	c7c6b1fe9f	ftrace: Allow to remove a single function from function graph filter I don't see why we can only clear all functions from the filter. After patching: # echo sys_open > set_graph_function # echo sys_close >> set_graph_function # cat set_graph_function sys_open sys_close # echo '!sys_close' >> set_graph_function # cat set_graph_function sys_open Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B726388.2000408@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-11 14:32:38 -05:00
Jean Delvare	5c42dc7070	devres/irq: Fix devm_irq_match comment Fix the reference (in comment). Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-11 16:01:02 +01:00
Yinghai Lu	e9a0064ad0	x86: Change range end to start+size So make interface more consistent with early_res. Later we can share some code with early_res. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-10-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-10 17:47:17 -08:00
Yinghai Lu	27811d8cab	x86: Move range related operation to one file We have almost the same code for mtrr cleanup and amd_bus checkup, and this code will also be used in replacing bootmem with early_res, so try to move them together and reuse it from different parts. Also rename update_range to subtract_range as that is what the function is actually doing. -v2: update comments as Christoph requested Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-4-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-10 17:47:17 -08:00
H. Peter Anvin	84abd88a70	Merge remote branch 'linus/master' into x86/bootmem	2010-02-10 16:55:28 -08:00
Brandon Phiilps	ced5b697a7	x86: Avoid race condition in pci_enable_msix() Keep chip_data in create_irq_nr and destroy_irq. When two drivers are setting up MSI-X at the same time via pci_enable_msix() there is a race. See this dmesg excerpt: [ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X [ 85.170611] alloc irq_desc for 99 on node -1 [ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X [ 85.170614] alloc kstat_irqs on node -1 [ 85.170616] alloc irq_2_iommu on node -1 [ 85.170617] alloc irq_desc for 100 on node -1 [ 85.170619] alloc kstat_irqs on node -1 [ 85.170621] alloc irq_2_iommu on node -1 [ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X [ 85.170626] alloc irq_desc for 101 on node -1 [ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X [ 85.170630] alloc kstat_irqs on node -1 [ 85.170631] alloc irq_2_iommu on node -1 [ 85.170635] alloc irq_desc for 102 on node -1 [ 85.170636] alloc kstat_irqs on node -1 [ 85.170639] alloc irq_2_iommu on node -1 [ 85.170646] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 As you can see igb and ixgbe are both alternating on create_irq_nr() via pci_enable_msix() in their probe function. ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data = NULL via dynamic_irq_init(). igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[] via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this: cfg_new = irq_desc_ptrs[102]->chip_data; if (cfg_new->vector != 0) continue; This hits the NULL deref. Another possible race exists via pci_disable_msix() in a driver or in the number of error paths that call free_msi_irqs(): destroy_irq() dynamic_irq_cleanup() which sets desc->chip_data = NULL ...race window... desc->chip_data = cfg; Remove the save and restore code for cfg in create_irq_nr() and destroy_irq() and take the desc->lock when checking the irq_cfg. Reported-and-analyzed-by: Brandon Philips <bphilips@suse.de> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-3-git-send-email-yinghai@kernel.org> Signed-off-by: Brandon Phililps <bphilips@suse.de> Cc: stable@kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-10 14:27:28 -08:00
Steven Rostedt	ede55c9d78	tracing: Add correct/incorrect to sort keys for branch annotation output The branch annotation is a bit difficult to see the worst offenders because it only sorts by percentage: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 0 163 100 qdisc_restart sch_generic.c 179 0 163 100 pfifo_fast_dequeue sch_generic.c 447 0 4 100 pskb_trim_rcsum skbuff.h 1689 0 4 100 llc_rcv llc_input.c 170 0 18 100 psmouse_interrupt psmouse-base.c 304 0 3 100 atkbd_interrupt atkbd.c 389 0 5 100 usb_alloc_dev usb.c 437 0 11 100 vsscanf vsprintf.c 1897 0 2 100 IS_ERR err.h 34 0 23 100 __rmqueue_fallback page_alloc.c 865 0 4 100 probe_wakeup_sched_switch trace_sched_wakeup.c 142 0 3 100 move_masked_irq migration.c 11 Adding the incorrect and correct values as sort keys makes this file a bit more informative: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 0 366541 100 audit_syscall_entry auditsc.c 1637 0 366538 100 audit_syscall_exit auditsc.c 1685 0 115839 100 sched_info_switch sched_stats.h 269 0 74567 100 sched_info_queued sched_stats.h 222 0 66578 100 sched_info_dequeued sched_stats.h 177 0 15113 100 trace_workqueue_insertion workqueue.h 38 0 15107 100 trace_workqueue_execution workqueue.h 45 0 3622 100 syscall_trace_leave ptrace.c 1772 0 2750 100 sched_move_task sched.c 10100 0 2750 100 sched_move_task sched.c 10110 0 1815 100 pre_schedule_rt sched_rt.c 1462 0 837 100 audit_alloc auditsc.c 879 0 814 100 tcp_mss_split_point tcp_output.c 1302 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-09 21:35:05 -05:00
Jason Wang	c93d89f3db	Export the symbol of getboottime and mmonotonic_to_bootbased Export getboottime and monotonic_to_bootbased in order to let them could be used by following patch. Cc: stable@kernel.org Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-02-09 19:20:15 +02:00
Anton Blanchard	301ba0457f	kthread, sched: Remove reference to kthread_create_on_cpu kthread_create_on_cpu doesn't exist so update a comment in kthread.c to reflect this. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100209040740.GB3702@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-09 11:47:39 +01:00
Anton Blanchard	b80109e256	Remove reference to kthread_create_on_cpu kthread_create_on_cpu doesn't exist so update a comment in kthread.c to reflect this. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-09 11:28:48 +01:00
Al Viro	cccc6bba3f	Lose the first argument of audit_inode_child() it's always equal to ->d_name.name of the second argument Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-08 14:38:36 -05:00
Anton Blanchard	fa535a77bd	sched: cpuacct: Use bigger percpu counter batch values for stats counters When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can call cpuacct_update_stats with values much larger than percpu_counter_batch. This means the call to percpu_counter_add will always add to the global count which is protected by a spinlock and we end up with a global spinlock in the scheduler. Based on an idea by KOSAKI Motohiro, this patch scales the batch value by cputime_one_jiffy such that we have the same batch limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot but that initialisation happened too early on PowerPC (before time_init) and it was never updated at runtime as a result of a hotplug cpu add/remove. This patch instead scales percpu_counter_batch by cputime_one_jiffy at runtime, which keeps the batch correct even after cpu hotplug operations. We cap it at INT_MAX in case of overflow. For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at least on x86 and PowerPC. So there is no need to add an #ifdef. On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel: CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec Tested with: wget http://ozlabs.org/~anton/junkcode/context_switch.c make context_switch for i in `seq 0 63`; do taskset -c $i ./context_switch & done vmstat 1 Signed-off-by: Anton Blanchard <anton@samba.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:57:37 +01:00
Ingo Molnar	6d3e0907b8	Merge branch 'sched/urgent' into sched/core Merge reason: Merge dependent fix, update to latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:55:46 +01:00
Andrew Morton	50200df462	kernel/sched.c: Suppress unused var warning On UP: kernel/sched.c: In function 'wake_up_new_task': kernel/sched.c:2631: warning: unused variable 'cpu' Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:53:19 +01:00
Don Zickus	84e478c6f1	nmi_watchdog: Config option to enable new nmi_watchdog These are the bits that enable the new nmi_watchdog and safely isolate the old nmi_watchdog. Only one or the other can run, not both at the same time. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: peterz@infradead.org LKML-Reference: <1265424425-31562-4-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:29:03 +01:00
Don Zickus	1fb9d6ad27	nmi_watchdog: Add new, generic implementation, using perf events This is a new generic nmi_watchdog implementation using the perf events infrastructure as suggested by Ingo. The implementation is simple, just create an in-kernel perf event and register an overflow handler to check for cpu lockups. I created a generic implementation that lives in kernel/ and the hardware specific part that for now lives in arch/x86. This approach has a number of advantages: - It simplifies the x86 PMU implementation in the long run, in that it removes the hardcoded low-level PMU implementation that was the NMI watchdog before. - It allows new NMI watchdog features to be added in a central place. - It allows other architectures to enable the NMI watchdog, as long as they have perf events (that provide NMIs) implemented. - It also allows for more graceful co-existence of existing perf events apps and the NMI watchdog - before these changes the relationship was exclusive. (The NMI watchdog will 'spend' a perf event when enabled. In later iterations we might be able to piggyback from an existing NMI event without having to allocate a hardware event for the NMI watchdog - turning this into a no-hardware-cost feature.) As for compatibility, we'll keep the old NMI watchdog code as well until the new one can 100% replace it on all CPUs, old and new alike. That might take some time as the NMI watchdog has been ported to many CPU models. I have done light testing to make sure the framework works correctly and it does. v2: Set the correct timeout values based on the old nmi watchdog Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: peterz@infradead.org LKML-Reference: <1265424425-31562-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:29:02 +01:00
H Hartley Sweeten	6622e670b2	posix-timers.c: Don't export local functions Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-05 14:54:10 +01:00
Magnus Damm	c54a42b19f	clocksource: add suspend callback Add a clocksource suspend callback. This callback can be used by the clocksource driver to shutdown and perform any kind of late suspend activities even though the clocksource driver itself is a non-sysdev driver. One example where this is useful is to fix the sh_cmt.c platform driver that today suspends using the platform bus and shuts down the clocksource too early. With this callback in place the sh_cmt driver will suspend using the clocksource and clockevent hooks and leave the platform device pm callbacks unused. Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-05 14:54:10 +01:00
Magnus Damm	17622339af	clocksource: add argument to resume callback Pass the clocksource as an argument to the clocksource resume callback. Needed so we can point out which CMT channel the sh_cmt.c driver shall resume. Signed-off-by: Magnus Damm <damm@opensource.se> Cc: john stultz <johnstul@us.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-05 14:54:10 +01:00
Daniel Mack	1537a3638c	tree-wide: fix 'lenght' typo in comments and code Some misspelled occurences of 'octet' and some comments were also fixed as I was on it. Signed-off-by: Daniel Mack <daniel@caiaq.de> Cc: Jiri Kosina <trivial@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Junio C Hamano <gitster@pobox.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:45 +01:00
Baruch Siach	9ce8e498ee	devres: typo fix s/dev/devm/ Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:43 +01:00
Edward Z. Yang	350f82586b	Remove redundant trailing semicolons from macros Signed-off-by: Edward Z. Yang <ezyang@ksplice.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:43 +01:00
Thadeu Lima de Souza Cascardo	af66585270	fix comment typo boo -> boot in ksysfs.c Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:37 +01:00
Adam Buchbinder	c9404c9c39	Fix misspelling of "should" and "shouldn't" in comments. Some comments misspell "should" or "shouldn't"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:30 +01:00
Masami Hiramatsu	5ecaafdbf4	kprobes: Add mcount to the kprobes blacklist Since mcount function can be called from everywhere, it should be blacklisted. Moreover, the "mcount" symbol is a special symbol name. So, it is better to put it in the generic blacklist. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-05 08:13:57 +01:00
Linus Torvalds	aa16cd8d12	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Handle futex value corruption gracefully futex: Handle user space corruption gracefully futex_lock_pi() key refcnt fix softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume	2010-02-04 16:07:41 -08:00
Adam Buchbinder	2a61aa4016	Fix misspellings of "invocation" in comments. Some comments misspell "invocation"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-04 11:55:45 +01:00
Adam Buchbinder	c41b20e721	Fix misspellings of "truly" in comments. Some comments misspell "truly"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-04 11:55:45 +01:00
Peter Zijlstra	9717e6cd3d	perf_events: Optimize perf_event_task_tick() Pretty much all of the calls do perf_disable/perf_enable cycles, pull that out to cut back on hardware programming. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:59:49 +01:00
Yong Zhang	2357725695	sched: Remove member rt_se from struct rt_rq It's a duplicate of tg->rt_se[cpu] and the only usage is sched_rt_rq_dequeue() and sched_rt_rq_enqueue(). After the first patch to those two function. rt_se can be removed. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <2674af741001282258q38781619u653ca4a7dd267347@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:57:33 +01:00
Yong Zhang	74b7eb5885	sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] This is the first step to remove rt_rq member rt_se because it have the same meaning with tg->rt_se[cpu]. And the latter style is also used by the fair scheduling class. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <2674af741001282257r28c97a92o9f90cf16fe8d3d84@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:57:32 +01:00
Masami Hiramatsu	f24bb999d2	ftrace: Remove record freezing Remove record freezing. Because kprobes never puts probe on ftrace's mcount call anymore, it doesn't need ftrace to check whether kprobes on it. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: przemyslaw@pawelczyk.it Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100202214925.4694.73469.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:19 +01:00
Masami Hiramatsu	4554dbcb85	kprobes: Check probe address is reserved Check whether the address of new probe is already reserved by ftrace or alternatives (on x86) when registering new probe. If reserved, it returns an error and not register the probe. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: przemyslaw@pawelczyk.it Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <20100202214918.4694.94179.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:19 +01:00
Masami Hiramatsu	2cfa19780d	ftrace/alternatives: Introducing _text_reserved functions Introducing _text_reserved functions for checking the text address range is partially reserved or not. This patch provides checking routines for x86 smp alternatives and dynamic ftrace. Since both functions modify fixed pieces of kernel text, they should reserve and protect those from other dynamic text modifier, like kprobes. This will also be extended when introducing other subsystems which modify fixed pieces of kernel text. Dynamic text modifiers should avoid those. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: przemyslaw@pawelczyk.it Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <20100202214911.4694.16587.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:19 +01:00
Masami Hiramatsu	615d0ebbc7	kprobes: Disable booster when CONFIG_PREEMPT=y Disable kprobe booster when CONFIG_PREEMPT=y at this time, because it can't ensure that all kernel threads preempted on kprobe's boosted slot run out from the slot even using freeze_processes(). The booster on preemptive kernel will be resumed if synchronize_tasks() or something like that is introduced. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100202214904.4694.24330.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:18 +01:00
Kees Cook	d78ca3cd73	syslog: use defined constants instead of raw numbers Right now the syslog "type" action are just raw numbers which makes the source difficult to follow. This patch replaces the raw numbers with defined constants for some level of sanity. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-04 14:20:41 +11:00
Kees Cook	002345925e	syslog: distinguish between /proc/kmsg and syscalls This allows the LSM to distinguish between syslog functions originating from /proc/kmsg access and direct syscalls. By default, the commoncaps will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg file descriptor. For example the kernel syslog reader can now drop privileges after opening /proc/kmsg, instead of staying privileged with CAP_SYS_ADMIN. MAC systems that implement security_syslog have unchanged behavior. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-04 14:20:12 +11:00
Mahesh Salgaonkar	cd757645fb	perf: Make bp_len type to u64 generic across the arch Change 'bp_len' type to __u64 to make it work across archs as the s390 architecture watch point length can be upto 2^64. reference: http://lkml.org/lkml/2010/1/25/212 This is an ABI change that is not backward compatible with the previous hardware breakpoint info layout integrated in this development cycle, a rebuilt of perf tools is necessary for versions based on 2.6.33-rc1 - 2.6.33-rc6 to work with a kernel based on this patch. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "K. Prasad" <prasad@linux.vnet.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin <schwidefsky@de.ibm.com> LKML-Reference: <20100130045518.GA20776@in.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-04 01:07:12 +01:00
Peter Zijlstra	b9c3032277	hrtimer, softirq: Fix hrtimer->softirq trampoline hrtimers callbacks are always done from hardirq context, either the jiffy tick interrupt or the hrtimer device interrupt. [ there is currently one exception that can still call a hrtimer callback from softirq, but even in that case this will still work correctly. ] Reported-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Yury Polyanskiy <ypolyans@princeton.edu> Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> LKML-Reference: <1265120401.24455.306.camel@laptop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-03 18:17:40 +01:00
Thomas Gleixner	59647b6ac3	futex: Handle futex value corruption gracefully The WARN_ON in lookup_pi_state which complains about a mismatch between pi_state->owner->pid and the pid which we retrieved from the user space futex is completely bogus. The code just emits the warning and then continues despite the fact that it detected an inconsistent state of the futex. A conveniant way for user space to spam the syslog. Replace the WARN_ON by a consistency check. If the values do not match return -EINVAL and let user space deal with the mess it created. This also fixes the missing task_pid_vnr() when we compare the pi_state->owner pid with the futex value. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>	2010-02-03 15:13:22 +01:00
Thomas Gleixner	51246bfd18	futex: Handle user space corruption gracefully If the owner of a PI futex dies we fix up the pi_state and set pi_state->owner to NULL. When a malicious or just sloppy programmed user space application sets the futex value to 0 e.g. by calling pthread_mutex_init(), then the futex can be acquired again. A new waiter manages to enqueue itself on the pi_state w/o damage, but on unlock the kernel dereferences pi_state->owner and oopses. Prevent this by checking pi_state->owner in the unlock path. If pi_state->owner is not current we know that user space manipulated the futex value. Ignore the mess and return -EINVAL. This catches the above case and also the case where a task hijacks the futex by setting the tid value and then tries to unlock it. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>	2010-02-03 15:13:22 +01:00
Mikael Pettersson	5ecb01cfdf	futex_lock_pi() key refcnt fix This fixes a futex key reference count bug in futex_lock_pi(), where a key's reference count is incremented twice but decremented only once, causing the backing object to not be released. If the futex is created in a temporary file in an ext3 file system, this bug causes the file's inode to become an "undead" orphan, which causes an oops from a BUG_ON() in ext3_put_super() when the file system is unmounted. glibc's test suite is known to trigger this, see <http://bugzilla.kernel.org/show_bug.cgi?id=14256>. The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's `38d47c1b70` "[PATCH] futex: rely on get_user_pages() for shared futexes". That commit made get_futex_key() also increment the reference count of the futex key, and updated its callers to decrement the key's reference count before returning. Unfortunately the normal exit path in futex_lock_pi() wasn't corrected: the reference count is incremented by get_futex_key() and queue_lock(), but the normal exit path only decrements once, via unqueue_me_pi(). The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31 this is easily done by 'goto out_put_key' rather than 'goto out'. Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org>	2010-02-03 15:13:22 +01:00
Linus Torvalds	c80d292f13	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: kernel/cred.c: use kmem_cache_free	2010-02-02 18:12:22 -08:00
Li Zefan	4528fd0595	cgroups: fix to return errno in a failure path In cgroup_create(), if alloc_css_id() returns failure, the errno is not propagated to userspace, so mkdir will fail silently. To trigger this bug, we mount blkio (or memory subsystem), and create more then 65534 cgroups. (The number of cgroups is limited to 65535 if a subsystem has use_id == 1) # mount -t cgroup -o blkio xxx /mnt # for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done # mkdir /mnt/65534 (should return ENOSPC) # Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-02 18:11:22 -08:00
Randy Dunlap	bc173f7092	kfifo: fix kernel-doc notation Fix kfifo kernel-doc warnings: Warning(kernel/kfifo.c:361): No description found for parameter 'total' Warning(kernel/kfifo.c:402): bad line: @ @lenout: pointer to output variable with copied data Warning(kernel/kfifo.c:412): No description found for parameter 'lenout' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-02 18:11:21 -08:00
Julia Lawall	b8a1d37c5f	kernel/cred.c: use kmem_cache_free Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather than kfree. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x,E,c; @@ x = $kmem_cache_alloc\\|kmem_cache_zalloc\\|kmem_cache_alloc_node$(c,...) ... when != x = E when != &x ?-kfree(x) +kmem_cache_free(c,x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Steve Dickson <steved@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-03 10:21:57 +11:00
Lai Jiangshan	4f48f8b7fd	tracing: Fix circular dead lock in stack trace When we cat <debugfs>/tracing/stack_trace, we may cause circular lock: sys_read() t_start() arch_spin_lock(&max_stack_lock); t_show() seq_printf(), vsnprintf() .... /* they are all trace-able, when they are traced, max_stack_lock may be required again. */ The following script can trigger this circular dead lock very easy: #!/bin/bash echo 1 > /proc/sys/kernel/stack_tracer_enabled mount -t debugfs xxx /mnt > /dev/null 2>&1 ( # make check_stack() zealous to require max_stack_lock for ((; ;)) { echo 1 > /mnt/tracing/stack_max_size } ) & for ((; ;)) { cat /mnt/tracing/stack_trace > /dev/null } To fix this bug, we increase the percpu trace_active before require the lock. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B67D4F9.9080905@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-02 10:20:18 -05:00
Peter Zijlstra	4a461c85b6	sched: Remove unused update_shares_locked() Commit `f492e12ef0` ("sched: Remove load_balance_newidle()") removed the only user of this function, so remove it too. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1265019219.24455.128.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-02 06:58:27 +01:00
Akinobu Mita	90fdbdb484	sched: Use for_each_bit No change in functionality. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1264938810-4173-1-git-send-email-akinobu.mita@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-02 06:58:27 +01:00
Tejun Heo	ab386128f2	Merge branch 'master' into percpu	2010-02-02 14:38:15 +09:00
Andrew Morton	e527300715	Generic page_is_ram: use __weak Use __weak instead of __attribute__((weak)). Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-01 16:58:17 -08:00
Wu Fengguang	61ef2489db	resources: introduce generic page_is_ram() It's based on walk_system_ram_range(), for archs that don't have their own page_is_ram(). The static verions in MIPS and SCORE are also made global. v4: prefer plain 1 instead of PAGE_IS_RAM (H. Peter Anvin) v3: add comment (KAMEZAWA Hiroyuki) "AFAIK, this "System RAM" information has been used for kdump to grab valid memory area and seems good for the kernel itself." v2: add PAGE_IS_RAM macro (Américo Wang) Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Américo Wang <xiyou.wangcong@gmail.com> Cc: linux-mips@linux-mips.org Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> LKML-Reference: <20100122081619.GA6431@localhost> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-01 16:58:17 -08:00
Linus Torvalds	e20da89130	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Fix check_usage_backwards() error message	2010-02-01 10:45:26 -08:00
Linus Torvalds	834db333ed	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails. perf: Ignore perf.data.old perf report: Fix segmentation fault when running with '-g none'	2010-02-01 10:45:00 -08:00
Linus Torvalds	8ea85c2817	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Correct printk whitespace in warning from cpu down task check sched: Fix incorrect sanity check sched: Fix fork vs hotplug vs cpuset namespaces	2010-02-01 10:44:36 -08:00
Linus Torvalds	bdd8466783	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Prevent potential kgdb dead lock	2010-02-01 10:44:06 -08:00
Jason Wessel	d6ad3e286d	softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets the time from hardware such as the TSC on x86. In this configuration kgdb will report a softlock warning message on resuming or detaching from a debug session. Sequence of events in the problem case: 1) "cpu sched clock" and "hardware time" are at 100 sec prior to a call to kgdb_handle_exception() 2) Debugger waits in kgdb_handle_exception() for 80 sec and on exit the following is called ... touch_softlockup_watchdog() --> __raw_get_cpu_var(touch_timestamp) = 0; 3) "cpu sched clock" = 100s (it was not updated, because the interrupt was disabled in kgdb) but the "hardware time" = 180 sec 4) The first timer interrupt after resuming from kgdb_handle_exception updates the watchdog from the "cpu sched clock" update_process_times() { ... run_local_timers() --> softlockup_tick() --> check (touch_timestamp == 0) (it is "YES" here, we have set "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() *(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the "touch_timestamp" will still be set to 100s.) ... scheduler_tick() (B)--> sched_clock_tick() (update "cpu sched clock" to "hardware time" = 180s) ... } 5) The Second timer interrupt handler appears to have a large jump and trips the softlockup warning. update_process_times() { ... run_local_timers() --> softlockup_tick() --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk "soft lockup error messages" ... } note: **(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" Why is "touch_timestamp" 100 sec, instead of 180 sec? When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of get_timestamp() is: get_timestamp(this_cpu) -->cpu_clock(this_cpu) -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, now) The __update_sched_clock() function uses the GTOD tick value to create a window to normalize the "now" values. So if "now" value is too big for sched_clock_data, it will be ignored. The fix is to invoke sched_clock_tick() to update "cpu sched clock" in order to recover from this state. This is done by introducing the function touch_softlockup_watchdog_sync(). This allows kgdb to request that the sched clock is updated when the watchdog thread runs the first time after a resume from kgdb. [yong.zhang0@gmail.com: Use per cpu instead of an array] Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: peterz@infradead.org LKML-Reference: <1264631124-4837-2-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-01 08:22:32 +01:00
Jason Wessel	5352ae638e	perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger This patch fixes the regression in functionality where the kernel debugger and the perf API do not nicely share hw breakpoint reservations. The kernel debugger cannot use any mutex_lock() calls because it can start the kernel running from an invalid context. A mutex free version of the reservation API needed to get created for the kernel debugger to safely update hw breakpoint reservations. The possibility for a breakpoint reservation to be concurrently processed at the time that kgdb interrupts the system is improbable. Should this corner case occur the end user is warned, and the kernel debugger will prohibit updating the hardware breakpoint reservations. Any time the kernel debugger reserves a hardware breakpoint it will be a system wide reservation. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: torvalds@linux-foundation.org LKML-Reference: <1264719883-7285-3-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-30 08:42:21 +01:00
Jason Wessel	cc0967490c	x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API In the 2.6.33 kernel, the hw_breakpoint API is now used for the performance event counters. The hw_breakpoint_handler() now consumes the hw breakpoints that were previously set by kgdb arch specific code. In order for kgdb to work in conjunction with this core API change, kgdb must use some of the low level functions of the hw_breakpoint API to install, uninstall, and deal with hw breakpoint reservations. The kgdb core required a change to call kgdb_disable_hw_debug anytime a slave cpu enters kgdb_wait() in order to keep all the hw breakpoints in sync as well as to prevent hitting a hw breakpoint while kgdb is active. During the architecture specific initialization of kgdb, it will pre-allocate 4 disabled (struct perf event **) structures. Kgdb will use these to manage the capabilities for the 4 hw breakpoint registers, per cpu. Right now the hw_breakpoint API does not have a way to ask how many breakpoints are available, on each CPU so it is possible that the install of a breakpoint might fail when kgdb restores the system to the run state. The intent of this patch is to first get the basic functionality of hw breakpoints working and leave it to the person debugging the kernel to understand what hw breakpoints are in use and what restrictions have been imposed as a result. Breakpoint constraints will be dealt with in a future patch. While atomic, the x86 specific kgdb code will call arch_uninstall_hw_breakpoint() and arch_install_hw_breakpoint() to manage the cpu specific hw breakpoints. The net result of these changes allow kgdb to use the same pool of hw_breakpoints that are used by the perf event API, but neither knows about future reservations for the available hw breakpoint slots. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: torvalds@linux-foundation.org LKML-Reference: <1264719883-7285-2-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-30 08:42:20 +01:00
Ingo Molnar	ae7f6711d6	Merge branch 'perf/urgent' into perf/core Merge reason: We want to queue up a dependent patch. Also update to later -rc's. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-29 10:36:22 +01:00
John Stultz	7e1b584774	ntp: Cleanup xtime references in ntp.c ntp.c doesn't need to access timekeeping internals directly, so change xtime references to use the get_seconds() timekeeping interface. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: richard@rsk.demon.co.uk LKML-Reference: <1264738844-21935-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-29 10:15:19 +01:00
john stultz	1f5b8f8a20	ntp: Make time_esterror and time_maxerror static Make time_esterror and time_maxerror static as no one uses them outside of ntp.c Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: richard@rsk.demon.co.uk LKML-Reference: <1264719761.3437.47.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-29 10:15:19 +01:00
Peter Zijlstra	75c9f3284a	perf_events: Fix sample_period transfer on inherit One problem with frequency driven counters is that we cannot predict the rate at which they trigger, therefore we have to start them at period=1, this causes a ramp up effect. However, if we fail to propagate the stable state on fork each new child will have to ramp up again. This can lead to significant artifacts in sample data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: eranian@google.com Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1264752266.4283.2121.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-29 09:15:26 +01:00
Xiao Guangrong	1e12a4a7a3	tracing/kprobe: Cleanup unused return value of tracing functions The return values of the kprobe's tracing functions are meaningless, lets remove these. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4B60E9A3.2040505@cn.fujitsu.com> [fweisbec@gmail: whitespace fixes, drop useless void returns in end of functions] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-29 02:14:40 +01:00
Xiao Guangrong	430ad5a600	perf: Factorize trace events raw sample buffer operations Introduce ftrace_perf_buf_prepare() and ftrace_perf_buf_submit() to gather the common code that operates on raw events sampling buffer. This cleans up redundant code between regular trace events, syscall events and kprobe events. Changelog v1->v2: - Rename function name as per Masami and Frederic's suggestion - Add __kprobes for ftrace_perf_buf_prepare() and make ftrace_perf_buf_submit() inline as per Masami's suggestion - Export ftrace_perf_buf_prepare since modules will use it Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4B60E92D.9000808@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-29 02:02:57 +01:00
Lai Jiangshan	ea2c68a08f	tracing: Simplify test for function_graph tracing start point In the function graph tracer, a calling function is to be traced only when it is enabled through the set_graph_function file, or when it is nested in an enabled function. Current code uses TSK_TRACE_FL_GRAPH to test whether it is nested or not. Looking at the code, we can get this: (trace->depth > 0) <==> (TSK_TRACE_FL_GRAPH is set) trace->depth is more explicit to tell that it is nested. So we use trace->depth directly and simplify the code. No functionality is changed. TSK_TRACE_FL_GRAPH is not removed yet, it is left for future usage. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B4DB0B6.7040607@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-29 01:05:12 +01:00
Mahesh Salgaonkar	b23ff0e933	hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails. On a given architecture, when hardware breakpoint registration fails due to un-supported access type (read/write/execute), we lose the bp slot since register_perf_hw_breakpoint() does not release the bp slot on failure. Hence, any subsequent hardware breakpoint registration starts failing with 'no space left on device' error. This patch introduces error handling in register_perf_hw_breakpoint() function and releases bp slot on error. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> LKML-Reference: <20100121125516.GA32521@in.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-28 14:15:51 +01:00
Frans Pop	9d3cfc4c1d	sched: Correct printk whitespace in warning from cpu down task check Due to an incorrect line break the output currently contains tabs. Also remove trailing space. The actual output that logcheck sent me looked like this: Task events/1 (pid = 10) is on cpu 1^I^I^I^I(state = 1, flags = 84208040) After this patch it becomes: Task events/1 (pid = 10) is on cpu 1 (state = 1, flags = 84208040) Signed-off-by: Frans Pop <elendilplanet.nl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <201001251456.34996.elendil@planet.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-28 06:59:55 +01:00
Peter Zijlstra	11854247e2	sched: Fix incorrect sanity check We moved to migrate on wakeup, which means that sleeping tasks could still be present on offline cpus. Amend the check to only test running tasks. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-28 06:59:51 +01:00
Peter Zijlstra	abd5071394	perf: Reimplement frequency driven sampling There was a bug in the old period code that caused intel_pmu_enable_all() or native_write_msr_safe() to show up quite high in the profiles. In staring at that code it made my head hurt, so I rewrote it in a hopefully simpler fashion. Its now fully symetric between tick and overflow driven adjustments and uses less data to boot. The only complication is that it basically wants to do a u128 division. The code approximates that in a rather simple truncate until it fits fashion, taking care to balance the terms while truncating. This version does not generate that sampling artefact. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-27 08:39:33 +01:00
Oleg Nesterov	48d5067417	lockdep: Fix check_usage_backwards() error message Lockdep has found the real bug, but the output doesn't look right to me: > ========================================================= > [ INFO: possible irq lock inversion dependency detected ] > 2.6.33-rc5 #77 > --------------------------------------------------------- > emacs/1609 just changed the state of lock: > (&(&tty->ctrl_lock)->rlock){+.....}, at: [<ffffffff8127c648>] tty_fasync+0xe8/0x190 > but this lock took another, HARDIRQ-unsafe lock in the past: > (&(&sighand->siglock)->rlock){-.....} "HARDIRQ-unsafe" and "this lock took another" looks wrong, afaics. > ... key at: [<ffffffff81c054a4>] __key.46539+0x0/0x8 > ... acquired at: > [<ffffffff81089af6>] __lock_acquire+0x1056/0x15a0 > [<ffffffff8108a0df>] lock_acquire+0x9f/0x120 > [<ffffffff81423012>] _raw_spin_lock_irqsave+0x52/0x90 > [<ffffffff8127c1be>] __proc_set_tty+0x3e/0x150 > [<ffffffff8127e01d>] tty_open+0x51d/0x5e0 The stack-trace shows that this lock (ctrl_lock) was taken under ->siglock (which is hopefully irq-safe). This is a clear typo in check_usage_backwards() where we tell the print a fancy routine we're forwards. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100126181641.GA10460@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-27 08:34:02 +01:00
Mike Frysinger	0368897034	tracing/documentation: Cover new frame pointer semantics Update the graph tracer examples to cover the new frame pointer semantics (in terms of passing it along). Move the HAVE_FUNCTION_GRAPH_FP_TEST docs out of the Kconfig, into the right place, and expand on the details. Signed-off-by: Mike Frysinger <vapier@gentoo.org> LKML-Reference: <1264165967-18938-1-git-send-email-vapier@gentoo.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-26 17:00:39 -05:00
Steven Rostedt	3c05d74827	ring-buffer: Check for end of page in iterator If the iterator comes to an empty page for some reason, or if the page is emptied by a consuming read. The iterator code currently does not check if the iterator is pass the contents, and may return a false entry. This patch adds a check to the ring buffer iterator to test if the current page has been completely read and sets the iterator to the next page if necessary. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-26 16:14:08 -05:00
Steven Rostedt	492a74f421	ring-buffer: Check if ring buffer iterator has stale data Usually reads of the ring buffer is performed by a single task. There are two types of reads from the ring buffer. One is a consuming read which will consume the entry that was read and the next read will be the entry that follows. The other is an iterator that will let the user read the contents of the ring buffer without modifying it. When an iterator is allocated, writes to the ring buffer are disabled to protect the iterator. The problem exists when consuming reads happen while an iterator is allocated. Specifically, the kind of read that swaps out an entire page (used by splice) and replaces it with a new read. If the iterator is on the page that is swapped out, then the next read may read from this swapped out page and return garbage. This patch adds a check when reading the iterator to make sure that the iterator contents are still valid. If a consuming read has taken place, the iterator is reset. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-26 16:09:30 -05:00
Thomas Gleixner	7b7422a566	clocksource: Prevent potential kgdb dead lock commit `0f8e8ef7` (clocksource: Simplify clocksource watchdog resume logic) introduced a potential kgdb dead lock. When the kernel is stopped by kgdb inside code which holds watchdog_lock then kgdb dead locks in clocksource_resume_watchdog(). clocksource_resume_watchdog() is called from kbdg via clocksource_touch_watchdog() to avoid that the clock source watchdog marks TSC unstable after the kernel has been stopped. Solve this by replacing spin_lock with a spin_trylock and just return in case the lock is held. Not resetting the watchdog might result in TSC becoming marked unstable, but that's an acceptable penalty for using kgdb. The timekeeping is anyway easily screwed up by kgdb when the system uses either jiffies or a clock source which wraps in short intervals (e.g. pm_timer wraps about every 4.6s), so we really do not have to worry about that occasional TSC marked unstable side effect. The second caller of clocksource_resume_watchdog() is clocksource_resume(). The trylock is safe here as well because the system is UP at this point, interrupts are disabled and nothing else can hold watchdog_lock(). Reported-by: Jason Wessel <jason.wessel@windriver.com> LKML-Reference: <1264480000-6997-4-git-send-email-jason.wessel@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-26 14:53:16 +01:00
Steven Rostedt	74bf4076f2	tracing: Prevent kernel oops with corrupted buffer If the contents of the ftrace ring buffer gets corrupted and the trace file is read, it could create a kernel oops (usualy just killing the user task thread). This is caused by the checking of the pid in the buffer. If the pid is negative, it still references the cmdline cache array, which could point to an invalid address. The simple fix is to test for negative PIDs. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-25 15:11:53 -05:00
Linus Torvalds	f6760aa024	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevent: Don't remove broadcast device when cpu is dead	2010-01-24 10:38:07 -08:00
Linus Torvalds	b8be634e01	Merge git://git.infradead.org/~dwmw2/mtd-2.6.33 * git://git.infradead.org/~dwmw2/mtd-2.6.33: mtd: tests: fix read, speed and stress tests on NOR flash mtd: Really add ARM pismo support kmsg_dump: Dump on crash_kexec as well	2010-01-24 10:31:34 -08:00
Thomas Gleixner	60db48cacb	sched: Queue a deboosted task to the head of the RT prio queue rtmutex_set_prio() is used to implement priority inheritance for futexes. When a task is deboosted it gets enqueued at the tail of its RT priority list. This is violating the POSIX scheduling semantics: rt priority list X contains two runnable tasks A and B task A runs with priority X and holds mutex M task C preempts A and is blocked on mutex M -> task A is boosted to priority of task C (Y) task A unlocks the mutex M and deboosts itself -> A is dequeued from rt priority list Y -> A is enqueued to the tail of rt priority list X task C schedules away task B runs This is wrong as task A did not schedule away and therefor violates the POSIX scheduling semantics. Enqueue the task to the head of the priority list instead. Reported-by: Mathias Weber <mathias.weber.mw1@roche.com> Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Carsten Emde <cbe@osadl.org> Tested-by: Mathias Weber <mathias.weber.mw1@roche.com> LKML-Reference: <20100120171629.809074113@linutronix.de>	2010-01-22 18:09:59 +01:00
Thomas Gleixner	37dad3fce9	sched: Implement head queueing for sched_rt The ability of enqueueing a task to the head of a SCHED_FIFO priority list is required to fix some violations of POSIX scheduling policy. Implement the functionality in sched_rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Carsten Emde <cbe@osadl.org> Tested-by: Mathias Weber <mathias.weber.mw1@roche.com> LKML-Reference: <20100120171629.772169931@linutronix.de>	2010-01-22 18:09:59 +01:00
Thomas Gleixner	ea87bb7853	sched: Extend enqueue_task to allow head queueing The ability of enqueueing a task to the head of a SCHED_FIFO priority list is required to fix some violations of POSIX scheduling policy. Extend the related functions with a "head" argument. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Carsten Emde <cbe@osadl.org> Tested-by: Mathias Weber <mathias.weber.mw1@roche.com> LKML-Reference: <20100120171629.734886007@linutronix.de>	2010-01-22 18:09:59 +01:00
Peter Zijlstra	fabf318e5e	sched: Fix fork vs hotplug vs cpuset namespaces There are a number of issues: 1) TASK_WAKING vs cgroup_clone (cpusets) copy_process(): sched_fork() child->state = TASK_WAKING; /* waiting for wake_up_new_task() */ if (current->nsproxy != p->nsproxy) ns_cgroup_clone() cgroup_clone() mutex_lock(inode->i_mutex) mutex_lock(cgroup_mutex) cgroup_attach_task() ss->can_attach() ss->attach() [ -> cpuset_attach() ] cpuset_attach_task() set_cpus_allowed_ptr(); while (child->state == TASK_WAKING) cpu_relax(); will deadlock the system. 2) cgroup_clone (cpusets) vs copy_process So even if the above would work we still have: copy_process(): if (current->nsproxy != p->nsproxy) ns_cgroup_clone() cgroup_clone() mutex_lock(inode->i_mutex) mutex_lock(cgroup_mutex) cgroup_attach_task() ss->can_attach() ss->attach() [ -> cpuset_attach() ] cpuset_attach_task() set_cpus_allowed_ptr(); ... p->cpus_allowed = current->cpus_allowed over-writing the modified cpus_allowed. 3) fork() vs hotplug if we unplug the child's cpu after the sanity check when the child gets attached to the task_list but before wake_up_new_task() shit will meet with fan. Solve all these issues by moving fork cpu selection into wake_up_new_task(). Reported-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1264106190.4283.1314.camel@laptop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-21 23:25:31 +01:00
Linus Torvalds	e80b135985	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: x86: Add support for the ANY bit perf: Change the is_software_event() definition perf: Honour event state for aux stream data perf: Fix perf_event_do_pending() fallback callsite perf kmem: Print usage help for unknown commands perf kmem: Increase "Hit" column length hw-breakpoints, perf: Fix broken mmiotrace due to dr6 by reference change perf timechart: Use tid not pid for COMM change	2010-01-21 08:50:04 -08:00
Peter Zijlstra	22e190851f	perf: Honour event state for aux stream data Anton reported that perf record kept receiving events even after calling ioctl(PERF_EVENT_IOC_DISABLE). It turns out that FORK,COMM and MMAP events didn't respect the disabled state and kept flowing in. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Anton Blanchard <anton@samba.org> LKML-Reference: <1263459187.4244.265.camel@laptop> CC: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:40 +01:00
Peter Zijlstra	fe432200ab	perf: Fix perf_event_do_pending() fallback callsite Paul questioned the context in which we should call perf_event_do_pending(). After looking at that I found that it should be called from IRQ context these days, however the fallback call-site is placed in softirq context. Ammend this by placing the callback in the IRQ timer path. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1263374859.4244.192.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:39 +01:00
Dhaval Giani	7c9414385e	sched: Remove USER_SCHED Remove the USER_SCHED feature. It has been scheduled to be removed in 2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2 Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1263990378.24844.3.camel@localhost> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:18 +01:00
Gautham R Shenoy	871e35bc97	sched: Fix the place where group powers are updated We want to update the sched_group_powers when balance_cpu == this_cpu. Currently the group powers are updated only if the balance_cpu is the first CPU in the local group. But balance_cpu = this_cpu could also be the first idle cpu in the group. Hence fix the place where the group powers are updated. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1264017764.5717.127.camel@jschopp-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:17 +01:00
Peter Zijlstra	8f190fb3f7	sched: Assume *balance is valid Since all load_balance() callers will have !NULL balance parameters we can now assume so and remove a few checks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:15 +01:00
Peter Zijlstra	f492e12ef0	sched: Remove load_balance_newidle() The two functions: load_balance{,_newidle}() are very similar, with the following differences: - rq->lock usage - sb->balance_interval updates - *balance check So remove the load_balance_newidle() call with load_balance(.idle = CPU_NEWLY_IDLE), explicitly unlock the rq->lock before calling (would be done by double_lock_balance() anyway), and ignore the other differences for now. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:14 +01:00
Peter Zijlstra	1af3ed3ddf	sched: Unify load_balance{,_newidle}() load_balance() and load_balance_newidle() look remarkably similar, one key point they differ in is the condition on when to active balance. So split out that logic into a separate function. One side effect is that previously load_balance_newidle() used to fail and return -1 under these conditions, whereas now it doesn't. I've not yet fully figured out the whole -1 return case for either load_balance{,_newidle}(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:13 +01:00
Peter Zijlstra	baa8c1102f	sched: Add a lock break for PREEMPT=y Since load-balancing can hold rq->locks for quite a long while, allow breaking out early when there is lock contention. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:13 +01:00
Peter Zijlstra	230059de77	sched: Remove from fwd decls Move code around to get rid of fwd declarations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:12 +01:00
Peter Zijlstra	897c395f4c	sched: Remove rq_iterator from move_one_task Again, since we only iterate the fair class, remove the abstraction. Since this is the last user of the rq_iterator, remove all that too. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:11 +01:00
Peter Zijlstra	ee00e66fff	sched: Remove rq_iterator usage from load_balance_fair Since we only ever iterate the fair class, do away with this abstraction. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:10 +01:00
Peter Zijlstra	3d45fd804a	sched: Remove the sched_class load_balance methods Take out the sched_class methods for load-balancing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:09 +01:00
Peter Zijlstra	1e3c88bdeb	sched: Move load balance code into sched_fair.c Straight fwd code movement. Since non of the load-balance abstractions are used anymore, do away with them and simplify the code some. In preparation move the code around. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:08 +01:00
Yong Zhang	6d558c3ac9	sched: Reassign prev and switch_count when reacquire_kernel_lock() fail Assume A->B schedule is processing, if B have acquired BKL before and it need reschedule this time. Then on B's context, it will go to need_resched_nonpreemptible for reschedule. But at this time, prev and switch_count are related to A. It's wrong and will lead to incorrect scheduler statistics. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <2674af741001102238w7b0ddcadref00d345e2181d11@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:39:04 +01:00
Mike Galbraith	50b926e439	sched: Fix vmark regression on big machines SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't enabled, leading to many cache misses on large machines as we traverse looking for an idle shared cache to wake to. Change the enabler of select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the sibling domain level. Reported-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1262612696.15495.15.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:39:03 +01:00
Xiaotian Feng	ea9d8e3f45	clockevent: Don't remove broadcast device when cpu is dead Marc reported that the BUG_ON in clockevents_notify() triggers on his system. This happens because the kernel tries to remove an active clock event device (used for broadcasting) from the device list. The handling of devices which can be used as per cpu device and as a global broadcast device is suboptimal. The simplest solution for now (and for stable) is to check whether the device is used as global broadcast device, but this needs to be revisited. [ tglx: restored the cpuweight check and massaged the changelog ] Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Tested-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> LKML-Reference: <1262834564-13033-1-git-send-email-dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2010-01-18 14:44:50 +01:00
Milton Miller	e03bcb6862	generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data The smp ipi data is passed around and given write access by other cpus and should be separated from per-cpu data consumed by this cpu. Looking for hot lines, I saw call_function_data shared with tick_cpu_sched. Signed-off-by: Milton Miller <miltonm@bga.com> Acked-by: Anton Blanchard <anton@samba.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: : Nick Piggin <npiggin@suse.de> LKML-Reference: <20100118020051.GR12666@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-18 09:02:59 +01:00
Ingo Molnar	f426a7e029	Merge branch 'perf/scheduling' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-01-18 08:56:41 +01:00
James Morris	2457552d1e	Merge branch 'master' into next	2010-01-18 09:56:22 +11:00
Frederic Weisbecker	329c0e012b	perf: Better order flexible and pinned scheduling When a task gets scheduled in. We don't touch the cpu bound events so the priority order becomes: cpu pinned, cpu flexible, task pinned, task flexible. So schedule out cpu flexibles when a new task context gets in and correctly order the groups to schedule in: task pinned, cpu flexible, task flexible. Cpu pinned groups don't need to be touched at this time. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:11:05 +01:00
Frederic Weisbecker	7defb0f879	perf: Don't schedule out/in pinned events on task tick We don't need to schedule in/out pinned events on task tick, now that pinned and flexible groups can be scheduled separately. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:09:51 +01:00
Frederic Weisbecker	5b0311e1f2	perf: Allow pinned and flexible groups to be scheduled separately Tune the scheduling helpers so that we can choose to schedule either pinned and/or flexible groups from a context. And while at it, refactor a bit the naming of these helpers to make these more consistent and flexible. There is no (intended) change in scheduling behaviour in this patch. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:08:57 +01:00
Frederic Weisbecker	42cce92f4d	perf: Make __perf_event_sched_out static __perf_event_sched_out doesn't need to be globally available, make it static. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:08:01 +01:00
Masami Hiramatsu	231e36f4d2	tracing/kprobe: Update kprobe tracing self test for new syntax Update kprobe tracing self test for new syntax (it supports deleting individual probes, and drops $argN support) and behavior change (new probes are disabled in default). This selftest includes the following checks: - Adding function-entry probe and return probe with arguments. - Enabling these probes. - Deleting it individually. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100114051211.7814.29436.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-17 08:15:35 +01:00
H Hartley Sweeten	6d686f4564	sched: Don't expose local functions kernel/sched: don't expose local functions The get_rr_interval_* functions are all class methods of struct sched_class. They are not exported so make them static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <201001132021.53253.hartleys@visionengravers.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-17 08:09:45 +01:00
Frederic Weisbecker	24a53652e3	tracing: Drop the tr check from the graph tracing path Each time we save a function entry from the function graph tracer, we check if the trace array is set, which is wasteful because it is set anyway before we start the tracer. All we need is to ensure we have good read and write orderings. When we set the trace array, we just need to guarantee it to be visible before starting tracing. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> LKML-Reference: <1263453795-7496-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-17 08:06:25 +01:00
Linus Torvalds	2a8249daf6	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futexes: Remove rw parameter from get_futex_key()	2010-01-16 12:31:30 -08:00
Linus Torvalds	6ccc347b69	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/filters: Add comment for match callbacks tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching lib: Introduce strnstr() tracing/filters: Fix MATCH_END_ONLY filter matching tracing/filters: Fix MATCH_FRONT_ONLY filter matching ftrace: Fix MATCH_END_ONLY function filter tracing/x86: Derive arch from bits argument in recordmcount.pl ring-buffer: Add rb_list_head() wrapper around new reader page next field ring-buffer: Wrap a list.next reference with rb_list_head()	2010-01-16 12:27:25 -08:00
David John	af2422c42c	smp_call_function_any(): pass the node value to cpumask_of_node() The change in acpi_cpufreq to use smp_call_function_any causes a warning when it is called since the function erroneously passes the cpu id to cpumask_of_node rather than the node that the cpu is on. Fix this. cpumask_of_node(3): node > nr_node_ids(1) Pid: 1, comm: swapper Not tainted 2.6.33-rc3-00097-g2c1f189 #223 Call Trace: [<ffffffff81028bb3>] cpumask_of_node+0x23/0x58 [<ffffffff81061f51>] smp_call_function_any+0x65/0xfa [<ffffffff810160d1>] ? do_drv_read+0x0/0x2f [<ffffffff81015fba>] get_cur_val+0xb0/0x102 [<ffffffff81016080>] get_cur_freq_on_cpu+0x74/0xc5 [<ffffffff810168a7>] acpi_cpufreq_cpu_init+0x417/0x515 [<ffffffff81562ce9>] ? __down_write+0xb/0xd [<ffffffff8148055e>] cpufreq_add_dev+0x278/0x922 Signed-off-by: David John <davidjon@xenontk.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:39 -08:00
Andi Kleen	5dab600e6a	kfifo: document everywhere that size has to be power of two On my first try using them I missed that the fifos need to be power of two, resulting in a runtime bug. Document that requirement everywhere (and fix one grammar bug) Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Andi Kleen	a5b9e2c106	kfifo: add kfifo_out_peek In some upcoming code it's useful to peek into a FIFO without permanentely removing data. This patch implements a new kfifo_out_peek() to do this. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Andi Kleen	64ce1037c5	kfifo: sanitize _user error handling Right now for kfifo__user it's not easily possible to distingush between a user copy failing and the FIFO not containing enough data. The problem is that both conditions are multiplexed into the same return code. Avoid this by moving the "copy length" into a separate output parameter and only return 0/-EFAULT in the main return value. I didn't fully adapt the weird "record" variants, those seem to be unused anyways and were rather messy (should they be just removed?) I would appreciate some double checking if I did all the conversions correctly. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Andi Kleen	8ecc295153	kfifo: use void * pointers for user buffers The pointers to user buffers are currently unsigned char , which requires a lot of casting in the caller for any non-char typed buffers. Use void instead. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Frederic Weisbecker	d6f962b57b	perf: Export software-only event group characteristic as a flag Before scheduling an event group, we first check if a group can go on. We first check if the group is made of software only events first, in which case it is enough to know if the group can be scheduled in. For that purpose, we iterate through the whole group, which is wasteful as we could do this check when we add/delete an event to a group. So we create a group_flags field in perf event that can host characteristics from a group of events, starting with a first PERF_GROUP_SOFTWARE flag that reduces the check on the fast path. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-16 12:30:40 +01:00
Frederic Weisbecker	e286417378	perf: Round robin flexible groups of events using list_rotate_left() This is more proper that doing it through a list_for_each_entry() that breaks after the first entry. v2: Don't rotate pinned groups as its not needed to time share them. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-16 12:30:28 +01:00
Frederic Weisbecker	889ff01506	perf/core: Split context's event group list into pinned and non-pinned lists Split-up struct perf_event_context::group_list into pinned_groups and flexible_groups (non-pinned). This first appears to be useless as it duplicates various loops around the group list handlings. But it scales better in the fast-path in perf_sched_in(). We don't anymore iterate twice through the entire list to separate pinned and non-pinned scheduling. Instead we interate through two distinct lists. The another desired effect is that it makes easier to define distinct scheduling rules on both. Changes in v2: - Respectively rename pinned_grp_list and volatile_grp_list into pinned_groups and flexible_groups as per Ingo suggestion. - Various cleanups Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-16 12:27:42 +01:00
Paul E. McKenney	017c426138	rcu: Fix sparse warnings Rename local variable "i" in rcu_init() to avoid conflict with RCU_INIT_FLAVOR(), restrict the scope of RCU_TREE_NONCORE, and make __synchronize_srcu() static. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12635142581560-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-16 10:25:22 +01:00
Li Zefan	d1303dd1d6	tracing/filters: Add comment for match callbacks We should be clear on 2 things: - the length parameter of a match callback includes tailing '\0'. - the string to be searched might not be NULL-terminated. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8770.7000608@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:14 -05:00
Li Zefan	16da27a8bc	tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING MATCH_FULL matching for PTR_STRING is not working correctly: # echo 'func == vt' > events/bkl/lock_kernel/filter # echo 1 > events/bkl/lock_kernel/enable ... # cat trace Xorg-1484 [000] 1973.392586: lock_kernel: ... func=vt_ioctl() gpm-1402 [001] 1974.027740: lock_kernel: ... func=vt_ioctl() We should pass to regex.match(..., len) the length (including '\0') of the source string instead of the length of the pattern string. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8763.5070707@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:12 -05:00
Li Zefan	b2af211f28	tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching The @str might not be NULL-terminated if it's of type DYN_STRING or STATIC_STRING, so we should use strnstr() instead of strstr(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8753.2000102@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:11 -05:00
Li Zefan	a3291c14ec	tracing/filters: Fix MATCH_END_ONLY filter matching For '*foo' pattern, we should allow any string ending with 'foo', but event filtering incorrectly disallows strings like bar_foo_foo: Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8735.6070604@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:07 -05:00
Li Zefan	285caad415	tracing/filters: Fix MATCH_FRONT_ONLY filter matching MATCH_FRONT_ONLY actually is a full matching: # ./perf record -R -f -a -e lock:lock_acquire \ --filter 'name ~rcu_*' sleep 1 # ./perf trace (no output) We should pass the length of the pattern string to strncmp(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8721.5090301@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:05 -05:00
Li Zefan	751e9983ee	ftrace: Fix MATCH_END_ONLY function filter For 'foo' pattern, we should allow any string ending with 'foo', but ftrace filter incorrectly disallows strings like bar_foo_foo: # echo 'io' > set_ftrace_filter # cat set_ftrace_filter \| grep 'req_bio_endio' # cat available_filter_functions \| grep 'req_bio_endio' req_bio_endio Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E870E.6060607@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:03 -05:00
Jamie Iles	8381f65d09	sched/perf: Make sure irqs are disabled for perf_event_task_sched_in() perf_event_task_sched_in() expects interrupts to be disabled, but on architectures with __ARCH_WANT_INTERRUPTS_ON_CTXSW defined, this isn't true. If this is defined, disable irqs around the call in finish_task_switch(). Signed-off-by: Jamie Iles <jamie.iles@picochip.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> LKML-Reference: <1262964453-27370-1-git-send-email-jamie.iles@picochip.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 10:43:08 +01:00
Masami Hiramatsu	14640106f2	tracing/kprobe: Drop function argument access syntax Drop function argument access syntax, because the function arguments depend on not only architecture but also compile-options and function API. And now, we have perf-probe for finding register/memory assigned to each argument. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Neuling <mikey@neuling.org> Cc: linuxppc-dev@ozlabs.org LKML-Reference: <20100105224648.19431.52309.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 10:09:12 +01:00
Ingo Molnar	61405fea92	Merge branch 'perf/urgent' into perf/core Merge reason: queue up dependent patch, update to -rc4 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 10:08:50 +01:00
KOSAKI Motohiro	7485d0d375	futexes: Remove rw parameter from get_futex_key() Currently, futexes have two problem: A) The current futex code doesn't handle private file mappings properly. get_futex_key() uses PageAnon() to distinguish file and anon, which can cause the following bad scenario: 1) thread-A call futex(private-mapping, FUTEX_WAIT), it sleeps on file mapping object. 2) thread-B writes a variable and it makes it cow. 3) thread-B calls futex(private-mapping, FUTEX_WAKE), it wakes up blocked thread on the anonymous page. (but it's nothing) B) Current futex code doesn't handle zero page properly. Read mode get_user_pages() can return zero page, but current futex code doesn't handle it at all. Then, zero page makes infinite loop internally. The solution is to use write mode get_user_page() always for page lookup. It prevents the lookup of both file page of private mappings and zero page. Performance concerns: Probaly very little, because glibc always initialize variables for futex before to call futex(). It means glibc users never see the overhead of this patch. Compatibility concerns: This patch has few compatibility issues. After this patch, FUTEX_WAIT require writable access to futex variables (read-only mappings makes EFAULT). But practically it's not a problem, glibc always initalizes variables for futexes explicitly - nobody uses read-only mappings. Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Ulrich Drepper <drepper@gmail.com> LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:17:36 +01:00
Paul E. McKenney	b6407e8639	rcu: Give different levels of the rcu_node hierarchy distinct lockdep names Previously, each level of the rcu_node hierarchy had the same rather unimaginative name: "&rcu_node_class[i]". This makes lockdep diagnostics involving these lockdep classes less helpful than would be nice. This patch fixes this by giving each level of the rcu_node hierarchy a distinct name: "rcu_node_level_0", "rcu_node_level_1", and so on. This version of the patch includes improved diagnostics suggested by Josh Triplett and Peter Zijlstra. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626498421830-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:07 +01:00
Paul E. McKenney	cba8244a0f	rcu: Add debug check for too many rcu_read_unlock() TREE_PREEMPT_RCU maintains an rcu_read_lock_nesting counter in the task structure, which happens to be a signed int. So this patch adds a check for this counter being negative at the end of __rcu_read_unlock(). This check is under CONFIG_PROVE_LOCKING, so can be thought of as being part of lockdep. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626498423064-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:06 +01:00
Paul E. McKenney	bf66f18e79	rcu: Add force_quiescent_state() testing to rcutorture Add force_quiescent_state() testing to rcutorture, with a separate thread that repeatedly invokes force_quiescent_state() in bursts. This can greatly increase the probability of encountering certain types of race conditions. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1262646551116-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:05 +01:00
Paul E. McKenney	46a1e34eda	rcu: Make force_quiescent_state() start grace period if needed Grace periods cannot be started while force_quiescent_state() is active. This is OK in that the affected CPUs will try again later, but it does induce needless grace-period delays. This patch causes rcu_start_gp() to record a failed attempt to start a grace period. When force_quiescent_state() prepares to return, it then starts the grace period if there was such a failed attempt. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501854-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:05 +01:00
Paul E. McKenney	45f014c52e	rcu: Remove redundant grace-period check The rcu_process_dyntick() function checks twice for the end of the current grace period. However, it holds the current rcu_node structure's ->lock field throughout, and doesn't get to the second call to rcu_gp_in_progress() unless there is at least one CPU corresponding to this rcu_node structure that has not yet checked in for the current grace period, which would prevent the current grace period from ending. So the current grace period cannot have ended, and the second check is redundant, so remove it. Also, given that this function is used even with !CONFIG_NO_HZ, its name is quite misleading. Change from rcu_process_dyntick() to force_qs_rnp(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1262646550562-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:04 +01:00
Paul E. McKenney	ee47eb9f4d	rcu: Remove leg of force_quiescent_state() switch statement The comparisons of rsp->gpnum nad rsp->completed in rcu_process_dyntick() and force_quiescent_state() can be replaced by the much more clear rcu_gp_in_progress() predicate function. After doing this, it becomes clear that the RCU_SAVE_COMPLETED leg of the force_quiescent_state() function's switch statement is almost completely a no-op. A small change to the RCU_SAVE_DYNTICK leg renders it a complete no-op, after which it can be removed. Doing so also eliminates the forcenow local variable from force_quiescent_state(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501781-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:04 +01:00
Paul E. McKenney	0f10dc8266	rcu: Eliminate rcu_process_dyntick() return value Because a new grace period cannot start while we are executing within the force_quiescent_state() function's switch statement, if any test within that switch statement or within any function called from that switch statement shows that the current grace period has ended, we can safely re-do that test any time before we leave the switch statement. This means that we no longer need a return value from rcu_process_dyntick(), as we can simply invoke rcu_gp_in_progress() to check whether the old grace period has finished -- there is no longer any need to worry about whether or not a new grace period has been started. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501857-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:03 +01:00
Paul E. McKenney	eb1ba45f1e	rcu: Eliminate second argument of rcu_process_dyntick() At this point, the second argument to all calls to rcu_process_dyntick() is a function of the same field of the structure passed in as the first argument, namely, rsp->gpnum-1. So propagate rsp->gpnum-1 to all uses of the second argument within rcu_process_dyntick() and then eliminate the second argument. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465503786-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:03 +01:00
Paul E. McKenney	39c0bbfc07	rcu: Eliminate local variable lastcomp from force_quiescent_state() Because rsp->fqs_active is set to 1 across force_quiescent_state()'s switch statement, rcu_start_gp() will refrain from starting a new grace period during this time. Therefore, rsp->gpnum is constant, and can be propagated to all uses of lastcomp, eliminating this local variable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465502985-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:03 +01:00
Paul E. McKenney	f3a8b5c6aa	rcu: Eliminate local variable signaled from force_quiescent_state() Because the root rcu_node lock is held across entry to the switch statement in force_quiescent_state(), it is no longer necessary to snapshot rsp->signaled to a local variable. Eliminate both the snapshotting and the local variable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1262646550602-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:02 +01:00
Paul E. McKenney	07079d5357	rcu: Prohibit starting new grace periods while forcing quiescent states Reduce the number and variety of race conditions by prohibiting the start of a new grace period while force_quiescent_state() is active. A new fqs_active flag in the rcu_state structure is used to trace whether or not force_quiescent_state() is active, and this new flag is tested by rcu_start_gp(). If the CPU that closed out the last grace period needs another grace period, this new grace period may be delayed up to one scheduling-clock tick, but it will eventually get started. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <126264655052-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:02 +01:00
Paul E. McKenney	559569acf9	rcu: Adjust force_quiescent_state() locking, step 2 This patch releases rnp->lock after the end of force_quiescent_state()'s switch statement. This is a second step towards prohibiting starting grace periods while force_quiescent_state() is executing, which will reduce the number and complexity of races that force_quiescent_state() is involved in. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501994-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:01 +01:00
Paul E. McKenney	f96e9232e0	rcu: Adjust force_quiescent_state() locking, step 1 This causes rnp->lock to be held on entry to force_quiescent_state()'s switch statement. This is a first step towards prohibiting starting grace periods while force_quiescent_state() is executing, which will reduce the number and complexity of races that force_quiescent_state() is involved in. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501455-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:01 +01:00
Andi Kleen	b45c6e76bc	kernel/signal.c: fix kernel information leak with print-fatal-signals=1 When print-fatal-signals is enabled it's possible to dump any memory reachable by the kernel to the log by simply jumping to that address from user space. Or crash the system if there's some hardware with read side effects. The fatal signals handler will dump 16 bytes at the execution address, which is fully controlled by ring 3. In addition when something jumps to a unmapped address there will be up to 16 additional useless page faults, which might be potentially slow (and at least is not very efficient) Fortunately this option is off by default and only there on i386. But fix it by checking for kernel addresses and also stopping when there's a page fault. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-11 09:34:05 -08:00
Dave Anderson	bd4f490a07	cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput() The LTP cgroup test suite generates a "kernel BUG at kernel/cgroup.c:790!" here in cgroup_diput(): /* * if we're getting rid of the cgroup, refcount should ensure * that there are no pidlists left. */ BUG_ON(!list_empty(&cgrp->pidlists)); The cgroup pidlist rework in 2.6.32 generates the BUG_ON, which is caused when pidlist_array_load() calls cgroup_pidlist_find(): (1) if a matching cgroup_pidlist is found, it down_write's the mutex of the pre-existing cgroup_pidlist, and increments its use_count. (2) if no matching cgroup_pidlist is found, then a new one is allocated, it down_write's its mutex, and the use_count is set to 0. (3) the matching, or new, cgroup_pidlist gets returned back to pidlist_array_load(), which increments its use_count -- regardless whether new or pre-existing -- and up_write's the mutex. So if a matching list is ever encountered by cgroup_pidlist_find() during the life of a cgroup directory, it results in an inflated use_count value, preventing it from ever getting released by cgroup_release_pid_array(). Then if the directory is subsequently removed, cgroup_diput() hits the BUG_ON() when it finds that the directory's cgroup is still populated with a pidlist. The patch simply removes the use_count increment when a matching pidlist is found by cgroup_pidlist_find(), because it gets bumped by the calling pidlist_array_load() function while still protected by the list's mutex. Signed-off-by: Dave Anderson <anderson@redhat.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Paul Menage <menage@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-11 09:34:05 -08:00
Masami Hiramatsu	8767ba2796	kmod: fix resource leak in call_usermodehelper_pipe() Fix resource (write-pipe file) leak in call_usermodehelper_pipe(). When call_usermodehelper_exec() fails, write-pipe file is opened and call_usermodehelper_pipe() just returns an error. Since it is hard for caller to determine whether the error occured when opening the pipe or executing the helper, the caller cannot close the pipe by themselves. I've found this resoruce leak when testing coredump. You can check how the resource leaks as below; $ echo "\|nocommand" > /proc/sys/kernel/core_pattern $ ulimit -c unlimited $ while [ 1 ]; do ./segv; done &> /dev/null & $ cat /proc/meminfo (<- repeat it) where segv.c is; //----- int main () { char p = 0; p = 1; } //----- This patch closes write-pipe file if call_usermodehelper_exec() failed. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-11 09:34:04 -08:00
Steven Rostedt	0e1ff5d72a	ring-buffer: Add rb_list_head() wrapper around new reader page next field If the very unlikely case happens where the writer moves the head by one between where the head page is read and where the new reader page is assigned _and_ the writer then writes and wraps the entire ring buffer so that the head page is back to what was originally read as the head page, the page to be swapped will have a corrupted next pointer. Simple solution is to wrap the assignment of the next pointer with a rb_list_head(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 20:40:44 -05:00
David Sharp	5ded3dc6a3	ring-buffer: Wrap a list.next reference with rb_list_head() This reference at the end of rb_get_reader_page() was causing off-by-one writes to the prev pointer of the page after the reader page when that page is the head page, and therefore the reader page has the RB_PAGE_HEAD flag in its list.next pointer. This eventually results in a GPF in a subsequent call to rb_set_head_page() (usually from rb_get_reader_page()) when that prev pointer is dereferenced. The dereferenced register would characteristically have an address that appears shifted left by one byte (eg, ffxxxxxxxxxxxxyy instead of ffffxxxxxxxxxxxx) due to being written at an address one byte too high. Signed-off-by: David Sharp <dhsharp@google.com> LKML-Reference: <1262826727-9090-1-git-send-email-dhsharp@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 20:38:25 -05:00
Steven Rostedt	d931369b74	tracing: Add stack dump to trace_printk if stacktrace option is set If the ftrace stacktrace option is set, then add the stack dumps to trace_printk. Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 18:09:57 -05:00
Lai Jiangshan	7e53bd42d1	tracing: Consolidate protection of reader access to the ring buffer At the beginning, access to the ring buffer was fully serialized by trace_types_lock. Patch `d7350c3f45` gives more freedom to readers, and patch `b04cc6b1f6` adds code to protect trace_pipe and cpu#/trace_pipe. But actually it is not enough, ring buffer readers are not always read-only, they may consume data. This patch makes accesses to trace, trace_pipe, trace_pipe_raw cpu#/trace, cpu#/trace_pipe and cpu#/trace_pipe_raw serialized. And removes tracing_reader_cpumask which is used to protect trace_pipe. Details: Ring buffer serializes readers, but it is low level protection. The validity of the events (which returns by ring_buffer_peek() ..etc) are not protected by ring buffer. The content of events may become garbage if we allow another process to consume these events concurrently: A) the page of the consumed events may become a normal page (not reader page) in ring buffer, and this page will be rewritten by the events producer. B) The page of the consumed events may become a page for splice_read, and this page will be returned to system. This patch adds trace_access_lock() and trace_access_unlock() primitives. These primitives allow multi process access to different cpu ring buffers concurrently. These primitives don't distinguish read-only and read-consume access. Multi read-only access is also serialized. And we don't use these primitives when we open files, we only use them when we read files. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B447D52.1050602@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:51:34 -05:00
Lai Jiangshan	0fa0edaf32	tracing: Remove show_format and related macros from TRACE_EVENT The previous patches added the use of print_fmt string and changes the trace_define_field() function to also create the fields and format output for the event format files. text data bss dec hex filename 5857201 1355780 9336808 16549789 fc879d vmlinux 5884589 1351684 9337896 16574169 fce6d9 vmlinux-orig The above shows the size of the vmlinux after this patch set compared to the vmlinux-orig which is before the patch set. This saves us 27k on text, 1k on bss and adds just 4k of data. The total savings of 24k in size. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D4D.40604@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:08:46 -05:00
Lai Jiangshan	5a65e95622	tracing: Use defined fields and print_fmt to print formats The calls ftrace_format_##call() and ftrace_define_fields_##call() are almost duplicate in functionality. With the addition of the print_fmt in previous patches, these two functions can be merged into one. The trace_define_field() defines the fields and links them into the struct ftrace_event_call. The previous patches introduced the print_fmt field and this can now be used with the trace_define_field() to create the event format file fields and print_fmt field. The struct ftrace_event_call->fields are used to print the fields The struct ftrace_event_call->print_fmt is used to print the "print fmt: XXXXXXXXXXX" line. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D49.5000006@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:08:20 -05:00
Steven Rostedt	c7ef3a9004	tracing: Have syscall tracing call its own init function In the clean up of having all events call one specific function, the syscall event init was changed to call this helper function. With the new print_fmt updates, the syscalls need to do special initializations. This patch converts the syscall events to call its own init function again. Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:02:32 -05:00
Lai Jiangshan	a342a0280b	tracing/kprobes: Init print_fmt for kprobe events This is part of a patch set that removes the show_format method in the ftrace event macros. Add the print_fmt initialization to the kprobe events. The print_fmt is still not used, but will be in the follow up patches. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D45.3080100@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:01:35 -05:00
Lai Jiangshan	50307a45f8	tracing/syscalls: Init print_fmt for syscall events This is part of a patch set that removes the show_format method in the ftrace event macros. Add the print_fmt initialization to the syscall events. The print_fmt is still not used, but will be in the follow up patches. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D41.609@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 11:58:32 -05:00
Lai Jiangshan	509e760cd9	tracing: Add print_fmt field This is part of a patch set that removes the show_format method in the ftrace event macros. The print_fmt field is added to hold the string that shows the print_fmt in the event format files. This patch only adds the field but it is currently not used. Later patches will use this field to enable us to remove the show_format field and function. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D3E.2000704@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 11:41:54 -05:00
Lai Jiangshan	809826a389	tracing: Have __dynamic_array() define a field This is part of a patch set that removes the show_format method in the ftrace event macros. This patch set requires that all fields are added to the ftrace_event_call->fields. This patch changes __dynamic_array() to call trace_define_field() to include fields that use __dynamic_array(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D36.8090100@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 11:30:02 -05:00
Ben Hutchings	10b465aaf9	modules: Skip empty sections when exporting section notes Commit `35dead4` "modules: don't export section names of empty sections via sysfs" changed the set of sections that have attributes, but did not change the iteration over these attributes in add_notes_attrs(). This can lead to add_notes_attrs() creating attributes with the wrong names or with null name pointers. Introduce a sect_empty() function and use it in both add_sect_attrs() and add_notes_attrs(). Reported-by: Martin Michlmayr <tbm@cyrius.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Tested-by: Martin Michlmayr <tbm@cyrius.com> Cc: stable@kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-06 01:11:29 -08:00
Steffen Klassert	16295bec63	padata: Generic parallelization/serialization interface This patch introduces an interface to process data objects in parallel. The parallelized objects return after serialization in the same order as they were before the parallelization. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-01-06 19:47:10 +11:00
Christoph Lameter	79615760f3	local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c ringbuffer*.c are the last users of local.h. Remove the include from modules.h and add it to ringbuffer files. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-01-05 15:34:50 +09:00
Christoph Lameter	e1783a240f	module: Use this_cpu_xx to dynamically allocate counters Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory requirements, cache footprint and decreases cycle counts. The this_cpu_xx operations are also used for !SMP mode. Otherwise we could not drop the use of __module_ref_addr() which would make per cpu data handling complicated. this_cpu_xx operations have their own fallback for !SMP. V8-V9: - Leave include asm/module.h since ringbuffer.c depends on it. Nothing else does though. Another patch will deal with that. - Remove spurious free. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-01-05 15:34:50 +09:00
Tejun Heo	32032df6c2	Merge branch 'master' into percpu Conflicts: arch/powerpc/platforms/pseries/hvCall.S include/linux/percpu.h	2010-01-05 09:17:33 +09:00
Linus Torvalds	952363c90c	Merge branch 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix NULL deref in inheritance code perf: Pass appropriate frame pointer to dump_trace()	2009-12-31 11:56:24 -08:00
Linus Torvalds	9d6e323c68	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf kmem: Fix statistics typo kprobes: Fix distinct type warning perf: Rename perf_event_hw_event in design document perf tools: Add missing header files to LIB_H Makefile variable perf record: We should fork only if a program was specified to run perf diff: Fix usage array, it must end with a NULL entry	2009-12-31 11:52:24 -08:00
Linus Torvalds	b21c070403	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix sign fields in ftrace_define_fields_##call() tracing/syscalls: Fix typo in SYSCALL_DEFINE0 tracing/kprobe: Show sign of fields in trace_kprobe format files ksym_tracer: Remove trace_stat ksym_tracer: Fix race when incrementing count ksym_tracer: Fix to allow writing newline to ksym_trace_filter ksym_tracer: Fix to make the tracer work tracing: Kconfig spelling fixes and cleanups tracing: Fix setting tracer specific options Documentation: Update ftrace-design.txt Documentation: Update tracepoint-analysis.txt Documentation: Update mmiotrace.txt	2009-12-31 11:52:01 -08:00
KOSAKI Motohiro	0f4bd46ec2	kmsg_dump: Dump on crash_kexec as well crash_kexec gets called before kmsg_dump(KMSG_DUMP_OOPS) if panic_on_oops is set, so the kernel log buffer is not stored for this case. This patch adds a KMSG_DUMP_KEXEC dump type which gets called when crash_kexec() is invoked. To avoid getting double dumps, the old KMSG_DUMP_PANIC is moved below crash_kexec(). The mtdoops driver is modified to handle KMSG_DUMP_KEXEC in the same way as a panic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-12-31 19:45:04 +00:00
Peter Zijlstra	05cbaa2853	perf: Fix NULL deref in inheritance code Liming found a NULL deref when a task has a perf context but no counters when it forks. This can occur in two cases, a race during construction where the fork hits after installing the context but before the first counter gets inserted, or more reproducably, a fork after the last counter is closed (which leaves the context around). Reported-by: Wang Liming <liming.wang@windriver.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> CC: <stable@kernel.org> LKML-Reference: <1262185684.7135.222.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-31 13:11:31 +01:00
Lai Jiangshan	fb7ae981cb	tracing: Fix sign fields in ftrace_define_fields_##call() Add is_signed_type() call to trace_define_field() in ftrace macros. The code previously just passed in 0 (false), disregarding whether or not the field was actually a signed type. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D3A.6020007@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-30 10:27:06 -05:00
Lai Jiangshan	79b4082108	tracing/kprobe: Show sign of fields in trace_kprobe format files The format files of trace_kprobe do not show the sign of the fields. The other format files show the field signed type of the fields and this patch makes the trace_kprobe formats consistent with the others. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D27.5040009@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-30 10:27:03 -05:00
Li Zefan	53ab668064	ksym_tracer: Remove trace_stat trace_stat is problematic. Don't use it, use seqfile instead. This fixes a race that reading the stat file is not protected by any lock, which can lead to use after free. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF203.40200@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:50 +01:00
Li Zefan	e6d9491bf8	ksym_tracer: Fix race when incrementing count We are under rcu read section but not holding the write lock, so count++ is not atomic. Use atomic64_t instead. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF1EC.9010608@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:49 +01:00
Li Zefan	3d13ec2efd	ksym_tracer: Fix to allow writing newline to ksym_trace_filter It used to work, but now doesn't: # echo > ksym_filter bash: echo: write error: Invalid argument It's caused by `d954fbf0ff` ("tracing: Fix wrong usage of strstrip in trace_ksyms"). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF1D7.5040400@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:49 +01:00
Li Zefan	88f7a890d7	ksym_tracer: Fix to make the tracer work ksym tracer doesn't work: # echo tasklist_lock:rw- > ksym_trace_filter -bash: echo: write error: No such device It's because we pass to perf_event_create_kernel_counter() a cpu number which is not present. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF19E.1010201@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:47 +01:00
Simon Kagstrom	d894837f23	sched: might_sleep(): Make file parameter const char * Fixes a warning when building with g++: warning: deprecated conversion from string constant to 'char*' And the file parameter use is constant, so mark it as such. Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Cc: peterz@infradead.org LKML-Reference: <20091223110818.442d848e@marrow.netinsight.se> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:50:13 +01:00
Randy Dunlap	40892367bc	tracing: Kconfig spelling fixes and cleanups Fix filename reference (ftrace-implementation.txt -> ftrace-design.txt). Fix spelling, punctuation, grammar. Fix help text indentation and line lengths to reduce need for horizontal scrolling or larger window sizes. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091221120117.3fb49cdc.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:37:54 +01:00
Li Zefan	07b139c8c8	perf events: Remove CONFIG_EVENT_PROFILE Quoted from Ingo: \| This reminds me - i think we should eliminate CONFIG_EVENT_PROFILE - \| it's an unnecessary Kconfig complication. If both PERF_EVENTS and \| EVENT_TRACING is enabled we should expose generic tracepoints. \| \| Nor is it limited to event 'profiling', so it has become a misnomer as \| well. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B2F1557.2050705@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:33:06 +01:00
Heiko Carstens	c2ef6661ce	kprobes: Fix distinct type warning Every time I see this: kernel/kprobes.c: In function 'register_kretprobe': kernel/kprobes.c:1038: warning: comparison of distinct pointer types lacks a cast I'm wondering if something changed in common code and we need to do something for s390. Apparently that's not the case. Let's get rid of this annoying warning. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> LKML-Reference: <20091221120224.GA4471@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:25:31 +01:00
Peter Zijlstra	49f474331e	perf events: Remove arg from perf sched hooks Since we only ever schedule the local cpu, there is no need to pass the cpu number to the perf sched hooks. This micro-optimizes things a bit. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 09:21:33 +01:00
Linus Torvalds	0b5e2588d8	Merge branch 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6 * 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6: SYSCTL: Add a mutex to the page_alloc zone order sysctl SYSCTL: Print binary sysctl warnings (nearly) only once	2009-12-24 13:01:29 -08:00
Andi Kleen	4440095c82	SYSCTL: Print binary sysctl warnings (nearly) only once When printing legacy sysctls print the warning message for each of them only once. This way there is a guarantee the syslog won't be flooded for any sane program. The original attempt at this made the tables non const and stored the flag inline. Linus suggested using a separate hash table for this, this is based on a code snippet from him. The hash implies this is not exact and can sometimes not print a new sysctl due to a hash collision, but in practice this should not be a problem I used a FNV32 hash over the binary string with a 32byte bitmap. This gives relatively little collisions when all the predefined binary sysctls are hashed: size 256 bucket length number 0: [25] 1: [67] 2: [88] 3: [47] 4: [22] 5: [6] 6: [1] The worst case is a single collision of 6 hash values. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-12-23 21:00:20 +01:00
Linus Torvalds	6432ed648a	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Revert `738d2be`, simplify set_task_cpu()	2009-12-23 09:12:57 -08:00
Peter Zijlstra	0c69774e6c	sched: Revert `738d2be`, simplify set_task_cpu() Effectively reverts `738d2be430`. As demonstrated by Eric, we really need to call __set_task_cpu() early in the fork() path to properly initialize the various task state -- specifically the cgroup state through set_task_rq(). [ we could probably fix this by explicitly calling __set_task_cpu() from sched_fork(), but lets try that for the next cycle and simply revert to the old behaviour for now. ] Reported-by: Eric Paris <eparis@redhat.com> Tested-by: Eric Paris <eparis@redhat.com>, Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: efault@gmx.de LKML-Reference: <1261492999.4937.36.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-23 10:04:10 +01:00
Linus Torvalds	fe35d4a028	Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: jfs: Fix 32bit build warning Remove obsolete comment in fs.h Sanitize f_flags helpers Fix f_flags/f_mode in case of lookup_instantiate_filp() from open(pathname, 3) anonfd: Allow making anon files read-only fs/compat_ioctl.c: fix build error when !BLOCK pohmelfs needs I_LOCK alloc_file(): simplify handling of mnt_clone_write() errors	2009-12-22 14:20:48 -08:00
Stefani Seibold	86d4880313	kfifo: add record handling functions Add kfifo_in_rec() - puts some record data into the FIFO Add kfifo_out_rec() - gets some record data from the FIFO Add kfifo_from_user_rec() - puts some data from user space into the FIFO Add kfifo_to_user_rec() - gets data from the FIFO and write it to user space Add kfifo_peek_rec() - gets the size of the next FIFO record field Add kfifo_skip_rec() - skip the next fifo out record Add kfifo_avail_rec() - determinate the number of bytes available in a record FIFO Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	a121f24acc	kfifo: add kfifo_skip, kfifo_from_user and kfifo_to_user Add kfifo_reset_out() for save lockless discard the fifo output Add kfifo_skip() to skip a number of output bytes Add kfifo_from_user() to copy user space data into the fifo Add kfifo_to_user() to copy fifo data to user space Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	7acd72eb85	kfifo: rename kfifo_put... into kfifo_in... and kfifo_get... into kfifo_out... rename kfifo_put... into kfifo_in... to prevent miss use of old non in kernel-tree drivers ditto for kfifo_get... -> kfifo_out... Improve the prototypes of kfifo_in and kfifo_out to make the kerneldoc annotations more readable. Add mini "howto porting to the new API" in kfifo.h Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	e64c026dd0	kfifo: cleanup namespace change name of __kfifo_* functions to kfifo_*, because the prefix __kfifo should be reserved for internal functions only. Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	c1e13f2567	kfifo: move out spinlock Move the pointer to the spinlock out of struct kfifo. Most users in tree do not actually use a spinlock, so the few exceptions now have to call kfifo_{get,put}_locked, which takes an extra argument to a spinlock. Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	4546548789	kfifo: move struct kfifo in place This is a new generic kernel FIFO implementation. The current kernel fifo API is not very widely used, because it has to many constrains. Only 17 files in the current 2.6.31-rc5 used it. FIFO's are like list's a very basic thing and a kfifo API which handles the most use case would save a lot of development time and memory resources. I think this are the reasons why kfifo is not in use: - The API is to simple, important functions are missing - A fifo can be only allocated dynamically - There is a requirement of a spinlock whether you need it or not - There is no support for data records inside a fifo So I decided to extend the kfifo in a more generic way without blowing up the API to much. The new API has the following benefits: - Generic usage: For kernel internal use and/or device driver. - Provide an API for the most use case. - Slim API: The whole API provides 25 functions. - Linux style habit. - DECLARE_KFIFO, DEFINE_KFIFO and INIT_KFIFO Macros - Direct copy_to_user from the fifo and copy_from_user into the fifo. - The kfifo itself is an in place member of the using data structure, this save an indirection access and does not waste the kernel allocator. - Lockless access: if only one reader and one writer is active on the fifo, which is the common use case, no additional locking is necessary. - Remove spinlock - give the user the freedom of choice what kind of locking to use if one is required. - Ability to handle records. Three type of records are supported: - Variable length records between 0-255 bytes, with a record size field of 1 bytes. - Variable length records between 0-65535 bytes, with a record size field of 2 bytes. - Fixed size records, which no record size field. - Preserve memory resource. - Performance! - Easy to use! This patch: Since most users want to have the kfifo as part of another object, reorganize the code to allow including struct kfifo in another data structure. This requires changing the kfifo_alloc and kfifo_init prototypes so that we pass an existing kfifo pointer into them. This patch changes the implementation and all existing users. [akpm@linux-foundation.org: fix warning] Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:55 -08:00
Linus Torvalds	83f57a11d8	Revert "time: Remove xtime_cache" This reverts commit `7bc7d63745`, as requested by John Stultz. Quoting John: "Petr Titěra reported an issue where he saw odd atime regressions with 2.6.33 where there were a full second worth of nanoseconds in the nanoseconds field. He also reviewed the time code and narrowed down the problem: unhandled overflow of the nanosecond field caused by rounding up the sub-nanosecond accumulated time. Details: * At the end of update_wall_time(), we currently round up the sub-nanosecond portion of accumulated time when storing it into xtime. This was added to avoid time inconsistencies caused when the sub-nanosecond portion was truncated when storing into xtime. Unfortunately we don't handle the possible second overflow caused by that rounding. * Previously the xtime_cache code hid this overflow by normalizing the xtime value when storing into the xtime_cache. * We could try to handle the second overflow after the rounding up, but since this affects the timekeeping's internal state, this would further complicate the next accumulation cycle, causing small errors in ntp steering. As much as I'd like to get rid of it, the xtime_cache code is known to work. * The correct fix is really to include the sub-nanosecond portion in the timekeeping accessor function, so we don't need to round up at during accumulation. This would greatly simplify the accumulation code. Unfortunately, we can't do this safely until the last three non-GENERIC_TIME arches (sparc32, arm, cris) are converted (those patches are in -mm) and we kill off the spots where arches set xtime directly. This is all 2.6.34 material, so I think reverting the xtime_cache change is the best approach for now. Many thanks to Petr for both reporting and finding the issue!" Reported-by: Petr Titěra <P.Titera@century.cz> Requested-by: john stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:10:37 -08:00
Al Viro	5300990c03	Sanitize f_flags helpers * pull ACC_MODE to fs.h; we have several copies all over the place * nightmarish expression calculating f_mode by f_flags deserves a helper too (OPEN_FMODE(flags)) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-22 12:27:34 -05:00
Roland Dreier	628ff7c1d8	anonfd: Allow making anon files read-only It seems a couple places such as arch/ia64/kernel/perfmon.c and drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile() instead of a private pseudo-fs + alloc_file(), if only there were a way to get a read-only file. So provide this by having anon_inode_getfile() create a read-only file if we pass O_RDONLY in flags. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-22 12:27:34 -05:00
Steven Rostedt	c757bea93b	tracing: Fix setting tracer specific options The function __set_tracer_option() takes as its last parameter a "neg" value. If set it should negate the value of the option. The trace_options_write() passed the value written to the file which is what the new value needs to be set as. But since this is not the negative, it never sets the value. Reported-by: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-21 22:35:16 -05:00
Dominik Brodowski	0e2c8b8f55	resources: fix call to alignf() in allocate_resource() The second parameter to alignf() in allocate_resource() must reflect what new resource is attempted to be allocated, else functions like pcibios_align_resource() (at least on x86) or pcmcia_align() can't work correctly. Commit `1e5ad96790` broke this by setting the "new" resource until we're about to return success. To keep the resource untouched when allocate_resource() fails, a "tmp" resource is introduced. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-21 10:42:29 -08:00
Peter Zijlstra	70f1120527	sched: Fix hotplug hang The hot-unplug kstopmachine usage does a wakeup after deactivating the cpu, hence we cannot use cpu_active() here but must rely on the good olde online. Reported-by: Sachin Sant <sachinp@in.ibm.com> Reported-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Jens Axboe <jens.axboe@oracle.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> LKML-Reference: <1261326987.4314.24.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-20 23:31:23 +01:00
Peter Zijlstra	3df0fc5b2e	sched: Restore printk sanity Revert the braindead pr_* crap. (Commit `663997d` "sched: Use pr_fmt() and pr_<level>()") It's dumb and causes stupid "sched: " strings all over the place. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Joe Perches <joe@perches.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1261315437.4314.6.camel@laptop> [ i dont mind the pr_*() patterns that much - but Peter dislikes them with a vengence. ] [ - v2: remove spurious diffstat from changelog :-/ ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-20 19:05:02 +01:00
Linus Torvalds	eca9dfcd00	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf session: Make events_stats u64 to avoid overflow on 32-bit arches hw-breakpoints: Fix hardware breakpoints -> perf events dependency perf events: Dont report side-band events on each cpu for per-task-per-cpu events perf events, x86/stacktrace: Fix performance/softlockup by providing a special frame pointer-only stack walker perf events, x86/stacktrace: Make stack walking optional perf events: Remove unused perf_counter.h header file perf probe: Check new event name kprobe-tracer: Check new event/group name perf probe: Check whether debugfs path is correct perf probe: Fix libdwarf include path for Debian	2009-12-19 09:48:42 -08:00
Linus Torvalds	aac3d39693	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) sched: Fix broken assertion sched: Assert task state bits at build time sched: Update task_state_arraypwith new states sched: Add missing state chars to TASK_STATE_TO_CHAR_STR sched: Move TASK_STATE_TO_CHAR_STR near the TASK_state bits sched: Teach might_sleep() about preemptible RCU sched: Make warning less noisy sched: Simplify set_task_cpu() sched: Remove the cfs_rq dependency from set_task_cpu() sched: Add pre and post wakeup hooks sched: Move kthread_bind() back to kthread.c sched: Fix select_task_rq() vs hotplug issues sched: Fix sched_exec() balancing sched: Ensure set_task_cpu() is never called on blocked tasks sched: Use TASK_WAKING for fork wakups sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE sched: Fix task_hot() test order sched: Fix set_cpu_active() in cpu_down() sched: Mark boot-cpu active before smp_init() sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK ...	2009-12-19 09:47:49 -08:00
Linus Torvalds	10e5453ffa	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sys: Fix missing rcu protection for __task_cred() access signals: Fix more rcu assumptions signal: Fix racy access to __task_cred in kill_pid_info_as_uid()	2009-12-19 09:47:34 -08:00
Linus Torvalds	3cd312c3e8	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: Remove duplicate setting of new_base in __mod_timer() clockevents: Prevent clockevent_devices list corruption on cpu hotplug	2009-12-19 09:47:18 -08:00
Al Viro	b4c30aad39	fix more leaks in audit_tree.c tag_chunk() Several leaks in audit_tree didn't get caught by commit `318b6d3d7d`, including the leak on normal exit in case of multiple rules refering to the same chunk. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-19 09:27:43 -08:00
Al Viro	6f5d511489	fix braindamage in audit_tree.c untag_chunk() ... aka "Al had badly fscked up when writing that thing and nobody noticed until Eric had fixed leaks that used to mask the breakage". The function essentially creates a copy of old array sans one element and replaces the references to elements of original (they are on cyclic lists) with those to corresponding elements of new one. After that the old one is fair game for freeing. First of all, there's a dumb braino: when we get to list_replace_init we use indices for wrong arrays - position in new one with the old array and vice versa. Another bug is more subtle - termination condition is wrong if the element to be excluded happens to be the last one. We shouldn't go until we fill the new array, we should go until we'd finished the old one. Otherwise the element we are trying to kill will remain on the cyclic lists... That crap used to be masked by several leaks, so it was not quite trivial to hit. Eric had fixed some of those leaks a while ago and the shit had hit the fan... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-19 09:27:43 -08:00
Linus Torvalds	55db493b65	Merge branch 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: cpumask: rename tsk_cpumask to tsk_cpus_allowed cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt cpumask: avoid dereferencing struct cpumask cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c cpumask: avoid deprecated function in mm/slab.c cpumask: use cpu_online in kernel/perf_event.c	2009-12-17 17:00:20 -08:00
Linus Torvalds	efc8e7f4c8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support NOMMU: Optimise away the {dac_,}mmap_min_addr tests security/min_addr.c: make init_mmap_min_addr() static keys: PTR_ERR return of wrong pointer in keyctl_get_security()	2009-12-17 16:58:26 -08:00
Linus Torvalds	dcc7cd0112	Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 * 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: fix kconfig for crc32 build error kmemleak: Reduce the false positives by checking for modified objects kmemleak: Show the age of an unreferenced object kmemleak: Release the object lock before calling put_object() kmemleak: Scan the _ftrace_events section in modules kmemleak: Simplify the kmemleak_scan_area() function prototype kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE	2009-12-17 16:00:19 -08:00
Randy Dunlap	6485536bcf	printk: fix new kernel-doc warnings Fix kernel-doc warnings in printk.c: Warning(kernel/printk.c:1422): No description found for parameter 'dumper' Warning(kernel/printk.c:1422): Excess function parameter 'dump' description in 'kmsg_dump_register' Warning(kernel/printk.c:1451): No description found for parameter 'dumper' Warning(kernel/printk.c:1451): Excess function parameter 'dump' description in 'kmsg_dump_unregister' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-17 15:45:32 -08:00
Oleg Nesterov	9cd80bbb07	do_wait() optimization: do not place sub-threads on task_struct->children list Thanks to Roland who pointed out de_thread() issues. Currently we add sub-threads to ->real_parent->children list. This buys nothing but slows down do_wait(). With this patch ->children contains only main threads (group leaders). The only complication is that forget_original_parent() should iterate over sub-threads by hand, and de_thread() needs another list_replace() when it changes ->group_leader. Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD tasks, we can remove this check (and we can unify do_wait_thread() and ptrace_do_wait()). This change can confuse the optimistic search in mm_update_next_owner(), but this is fixable and minor. Perhaps badness() and oom_kill_process() should be updated, but they should be fixed in any case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-17 15:45:31 -08:00
WANG Cong	3e26120cc7	kernel/sysctl.c: fix the incomplete part of sysctl_max_map_count-should-be-non-negative.patch It is a mistake that we used 'proc_dointvec', it should be 'proc_dointvec_minmax', as in the original patch. Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-17 15:45:30 -08:00
Linus Torvalds	5a865c0606	Merge branch 'for-33' of git://repo.or.cz/linux-kbuild * 'for-33' of git://repo.or.cz/linux-kbuild: (29 commits) net: fix for utsrelease.h moving to generated gen_init_cpio: fixed fwrite warning kbuild: fix make clean after mismerge kbuild: generate modules.builtin genksyms: properly consider EXPORT_UNUSED_SYMBOL{,_GPL}() score: add asm/asm-offsets.h wrapper unifdef: update to upstream revision 1.190 kbuild: specify absolute paths for cscope kbuild: create include/generated in silentoldconfig scripts/package: deb-pkg: use fakeroot if available scripts/package: add KBUILD_PKG_ROOTCMD variable scripts/package: tar-pkg: use tar --owner=root Kbuild: clean up marker net: add net_tstamp.h to headers_install kbuild: move utsrelease.h to include/generated kbuild: move autoconf.h to include/generated drop explicit include of autoconf.h kbuild: move compile.h to include/generated kbuild: drop include/asm kbuild: do not check for include/asm-$ARCH ... Fixed non-conflicting clean merge of modpost.c as per comments from Stephen Rothwell (modpost.c had grown an include of linux/autoconf.h that needed to be changed to generated/autoconf.h)	2009-12-17 07:23:42 -08:00
Peter Zijlstra	077614ee1e	sched: Fix broken assertion There's a preemption race in the set_task_cpu() debug check in that when we get preempted after setting task->state we'd still be on the rq proper, but fail the test. Check for preempted tasks, since those are always on the RQ. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20091217121830.137155561@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 13:22:46 +01:00
Peter Zijlstra	5d27c23df0	perf events: Dont report side-band events on each cpu for per-task-per-cpu events Acme noticed that his FORK/MMAP numbers were inflated by about the same factor as his cpu-count. This led to the discovery of a few more sites that need to respect the event->cpu filter. Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091217121830.215333434@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 13:21:36 +01:00
Frederic Weisbecker	61c1917f47	perf events, x86/stacktrace: Make stack walking optional The current print_context_stack helper that does the stack walking job is good for usual stacktraces as it walks through all the stack and reports even addresses that look unreliable, which is nice when we don't have frame pointers for example. But we have users like perf that only require reliable stacktraces, and those may want a more adapted stack walker, so lets make this function a callback in stacktrace_ops that users can tune for their needs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1261024834-5336-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 09:56:19 +01:00
Frederic Weisbecker	234da7bcdc	sched: Teach might_sleep() about preemptible RCU In practice, it is harmless to voluntarily sleep in a rcu_read_lock() section if we are running under preempt rcu, but it is illegal if we build a kernel running non-preemptable rcu. Currently, might_sleep() doesn't notice sleepable operations under rcu_read_lock() sections if we are running under preemptable rcu because preempt_count() is left untouched after rcu_read_lock() in this case. But we want developers who test their changes under such config to notice the "sleeping while atomic" issues. So we add rcu_read_lock_nesting to prempt_count() in might_sleep() checks. [ v2: Handle rcu-tiny ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1260991265-8451-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 09:46:44 +01:00
Masami Hiramatsu	6f3cf44047	kprobe-tracer: Check new event/group name Check new event/group name is same syntax as a C symbol. In other words, checking the name is as like as other tracepoint events. This can prevent user to create an event with useless name (e.g. foo\|bar, foo*bar). Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> LKML-Reference: <20091216222408.14459.68790.stgit@dhcp-100-2-132.bos.redhat.com> [ v2: minor cleanups ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 09:42:44 +01:00
Ingo Molnar	416eb39556	sched: Make warning less noisy Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.807938893@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 06:05:49 +01:00
Rusty Russell	62ac127950	cpumask: avoid dereferencing struct cpumask struct cpumask will be undefined soon with CONFIG_CPUMASK_OFFSTACK=y, to avoid them being declared on the stack. cpumask_bits() does what we want here (of course, this code is crap). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> To: Thomas Gleixner <tglx@linutronix.de>	2009-12-17 11:43:29 +10:30
Rusty Russell	f6325e30eb	cpumask: use cpu_online in kernel/perf_event.c Also, we want to check against nr_cpu_ids, not num_possible_cpus(). The latter works, but the correct bounds check is < nr_cpu_ids. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> To: Thomas Gleixner <tglx@linutronix.de>	2009-12-17 11:43:11 +10:30
Simon Horman	cf1e367ee8	timers: Remove duplicate setting of new_base in __mod_timer() new_base is set using per_cpu(tvec_bases, cpu) after selecting the desired value of cpu immediately below so this line is a unnecessary. Signed-off-by: Simon Horman <horms@verge.net.au> LKML-Reference: <20091217001542.GD25317@verge.net.au> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-12-17 01:30:49 +01:00
David Howells	6e14154676	NOMMU: Optimise away the {dac_,}mmap_min_addr tests In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be skipped by the compiler. We do this as the minimum mmap address doesn't make any sense in NOMMU mode. mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-12-17 09:25:19 +11:00
Andi Kleen	61cf693159	[sysctl] Fix breakage on systems with older glibc As predicted during code review, the sysctl(2) changes made systems with old glibc nearly unusable. About every command gives a: warning: process `ls' used the deprecated sysctl system call with 1.4 warning in the log. I see this on a SUSE 10.0 system with glibc 2.3.5. Don't warn for this common case. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 12:36:18 -08:00
Linus Torvalds	8aedf8a6ae	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (52 commits) perf record: Use per-task-per-cpu events for inherited events perf record: Properly synchronize child creation perf events: Allow per-task-per-cpu counters perf diff: Percent calcs should use double values perf diff: Change the default sort order to "dso,symbol" perf diff: Use perf_session__fprintf_hists just like 'perf record' perf report: Fix cut'n'paste error recently introduced perf session: Move perf report specific hits out of perf_session__fprintf_hists perf tools: Move hist entries printing routines from perf report perf report: Generalize perf_session__fprintf_hists() perf symbols: Move symbol filtering to event__preprocess_sample() perf symbols: Adopt the strlists for dso, comm perf symbols: Make symbol_conf global perf probe: Fix to show which probe point is not found perf probe: Check symbols in symtab/kallsyms perf probe: Check build-id of vmlinux perf probe: Reject second attempt of adding same-name event perf probe: Support event name for --add option perf probe: Add glob matching support on --del perf probe: Use strlist__for_each macros in probe-event.c ...	2009-12-16 12:32:47 -08:00
Linus Torvalds	da184a8064	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix return of trace_dump_stack() ksym_tracer: Fix bad cast tracing/power: Remove two exports tracing: Change event->profile_count to be int type tracing: Simplify trace_option_write() tracing: Remove useless trace option tracing: Use seq file for trace_clock tracing: Use seq file for trace_options function-graph: Allow writing the same val to set_graph_function ftrace: Call trace_parser_clear() properly ftrace: Return EINVAL when writing invalid val to set_ftrace_filter tracing: Move a printk out of ftrace_raw_reg_event_foo() tracing: Pull up calls to trace_define_common_fields() tracing: Extract duplicate ftrace_raw_init_event_foo() ftrace.h: Use common pr_info fmt string tracing: Add stack trace to irqsoff tracer tracing: Add trace_dump_stack() ring-buffer: Move resize integrity check under reader lock ring-buffer: Use sync sched protection on ring buffer resizing tracing: Fix wrong usage of strstrip in trace_ksyms	2009-12-16 12:02:25 -08:00
Linus Torvalds	74f3ae7434	Merge branch 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: modpost: fix segfault with short symbol names module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y Kbuild: clear marker out of modpost module: make MODULE_SYMBOL_PREFIX into a CONFIG option ARM: unexport symbols used to implement floating point emulation ARM: use unified discard definition in linker script x86: don't export inline function sparc64: don't export static inline pci_ functions	2009-12-16 10:47:24 -08:00
Linus Torvalds	60d9aa758c	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: (90 commits) jffs2: Fix long-standing bug with symlink garbage collection. mtd: OneNAND: Fix test of unsigned in onenand_otp_walk() mtd: cfi_cmdset_0002, fix lock imbalance Revert "mtd: move mxcnd_remove to .exit.text" mtd: m25p80: add support for Macronix MX25L4005A kmsg_dump: fix build for CONFIG_PRINTK=n mtd: nandsim: add support for 4KiB pages mtd: mtdoops: refactor as a kmsg_dumper mtd: mtdoops: make record size configurable mtd: mtdoops: limit the maximum mtd partition size mtd: mtdoops: keep track of used/unused pages in an array mtd: mtdoops: several minor cleanups core: Add kernel message dumper to call on oopses and panics mtd: add ARM pismo support mtd: pxa3xx_nand: Fix PIO data transfer mtd: nand: fix multi-chip suspend problem mtd: add support for switching old SST chips into QRY mode mtd: fix M29W800D dev_id and uaddr mtd: don't use PF_MEMALLOC mtd: Add bad block table overrides to Davinci NAND driver ... Fixed up conflicts (mostly trivial) in drivers/mtd/devices/m25p80.c drivers/mtd/maps/pcmciamtd.c drivers/mtd/nand/pxa3xx_nand.c kernel/printk.c	2009-12-16 10:23:43 -08:00
Peter Zijlstra	738d2be430	sched: Simplify set_task_cpu() Rearrange code a bit now that its a simpler function. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.269101883@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:59 +01:00
Peter Zijlstra	88ec22d3ed	sched: Remove the cfs_rq dependency from set_task_cpu() In order to remove the cfs_rq dependency from set_task_cpu() we need to ensure the task is cfs_rq invariant for all callsites. The simple approach is to substract cfs_rq->min_vruntime from se->vruntime on dequeue, and add cfs_rq->min_vruntime on enqueue. However, this has the downside of breaking FAIR_SLEEPERS since we loose the old vruntime as we only maintain the relative position. To solve this, we observe that we only migrate runnable tasks, we do this using deactivate_task(.sleep=0) and activate_task(.wakeup=0), therefore we can restrain the min_vruntime invariance to that state. The only other case is wakeup balancing, since we want to maintain the old vruntime we cannot make it relative on dequeue, but since we don't migrate inactive tasks, we can do so right before we activate it again. This is where we need the new pre-wakeup hook, we need to call this while still holding the old rq->lock. We could fold it into ->select_task_rq(), but since that has multiple callsites and would obfuscate the locking requirements, that seems like a fudge. This leaves the fork() case, simply make sure that ->task_fork() leaves the ->vruntime in a relative state. This covers all cases where set_task_cpu() gets called, and ensures it sees a relative vruntime. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.191697025@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:58 +01:00
Peter Zijlstra	efbbd05a59	sched: Add pre and post wakeup hooks As will be apparent in the next patch, we need a pre wakeup hook for sched_fair task migration, hence rename the post wakeup hook and one pre wakeup. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.114746117@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:58 +01:00
Peter Zijlstra	881232b70b	sched: Move kthread_bind() back to kthread.c Since kthread_bind() lost its dependencies on sched.c, move it back where it came from. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.039524041@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:57 +01:00
Peter Zijlstra	5da9a0fb67	sched: Fix select_task_rq() vs hotplug issues Since select_task_rq() is now responsible for guaranteeing ->cpus_allowed and cpu_active_mask, we need to verify this. select_task_rq_rt() can blindly return smp_processor_id()/task_cpu() without checking the valid masks, select_task_rq_fair() can do the same in the rare case that all SD_flags are disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.961475466@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:57 +01:00
Peter Zijlstra	3802290628	sched: Fix sched_exec() balancing Since we access ->cpus_allowed without holding rq->lock we need a retry loop to validate the result, this comes for near free when we merge sched_migrate_task() into sched_exec() since that already does the needed check. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.884743662@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:56 +01:00
Peter Zijlstra	e2912009fb	sched: Ensure set_task_cpu() is never called on blocked tasks In order to clean up the set_task_cpu() rq dependencies we need to ensure it is never called on blocked tasks because such usage does not pair with consistent rq->lock usage. This puts the migration burden on ttwu(). Furthermore we need to close a race against changing ->cpus_allowed, since select_task_rq() runs with only preemption disabled. For sched_fork() this is safe because the child isn't in the tasklist yet, for wakeup we fix this by synchronizing set_cpus_allowed_ptr() against TASK_WAKING, which leaves sched_exec to be a problem This also closes a hole in (`6ad4c1888` sched: Fix balance vs hotplug race) where ->select_task_rq() doesn't validate the result against the sched_domain/root_domain. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.807938893@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:56 +01:00
Peter Zijlstra	06b83b5fbe	sched: Use TASK_WAKING for fork wakups For later convenience use TASK_WAKING for fresh tasks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.732561278@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:55 +01:00
Peter Zijlstra	e4f4288842	sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE We should skip !SD_LOAD_BALANCE domains. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.653578430@chello.nl> CC: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:55 +01:00
Peter Zijlstra	e6c8fba777	sched: Fix task_hot() test order Make sure not to access sched_fair fields before verifying it is indeed a sched_fair task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> CC: stable@kernel.org LKML-Reference: <20091216170517.577998058@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:54 +01:00
Xiaotian Feng	9ee349ad6d	sched: Fix set_cpu_active() in cpu_down() Sachin found cpu hotplug test failures on powerpc, which made the kernel hang on his POWER box. The problem is that we fail to re-activate a cpu when a hot-unplug fails. Fix this by moving the de-activation into _cpu_down after doing the initial checks. Remove the synchronize_sched() calls and rely on those implied by rebuilding the sched domains using the new mask. Reported-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Tested-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.500272612@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:53 +01:00
Ingo Molnar	ee1156c11a	Merge branch 'linus' into sched/urgent Conflicts: kernel/sched_idletask.c Merge reason: resolve the conflicts, pick up latest changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 18:33:49 +01:00
Peter Zijlstra	f4c4176f21	perf events: Allow per-task-per-cpu counters In order to allow for per-task-per-cpu counters, useful for scalability when profiling task hierarchies, we allow installing events with event->cpu != -1 in task contexts. __perf_event_sched_in() already skips events where ->cpu mis-matches the current cpu, fix up __perf_install_in_context() and __perf_event_enable() to also respect this filter. This does lead to vary hard to interpret enabled/running times for such counters, but I don't see a simple solution for that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: fweisbec@gmail.com Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091216165904.831451147@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 18:30:11 +01:00
Amerigo Wang	06a7f71124	kexec: premit reduction of the reserved memory size Implement shrinking the reserved memory for crash kernel, if it is more than enough. For example, if you have already reserved 128M, now you just want 100M, you can do: # echo $((10010241024)) > /sys/kernel/kexec_crash_size Note, you can only do this before loading the crash kernel. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Neil Horman <nhorman@redhat.com> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:13 -08:00
André Goddard Rosa	417e315247	pid: reduce code size by using a pointer to iterate over array It decreases code size by 16 bytes on my gcc 4.4.1 on Core 2: text data bss dec hex filename 4314 2216 8 6538 198a kernel/pid.o-BEFORE 4298 2216 8 6522 197a kernel/pid.o-AFTER Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:12 -08:00
André Goddard Rosa	7be6d991bc	pid: tighten pidmap spinlock critical section by removing kfree() Avoid calling kfree() under pidmap spinlock, calling it afterwards. Normally kfree() is fast, but sometimes it can be slow, so avoid calling it under the spinlock if we can do it. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:12 -08:00
Oleg Nesterov	1be53963b0	signals: check ->group_stop_count after tracehook_get_signal() Move the call to do_signal_stop() down, after tracehook call. This makes ->group_stop_count condition visible to tracers before do_signal_stop() will participate in this group-stop. Currently the patch has no effect, tracehook_get_signal() always returns 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	ad09750b51	signals: kill force_sig_specific() Kill force_sig_specific(), this trivial wrapper has no callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	7486e5d9fc	signals: cosmetic, collect_signal: use SI_USER Trivial, s/0/SI_USER/ in collect_signal() for grep. This is a bit confusing, we don't know the source of this signal. But we don't care, and "info->si_code = 0" is imho worse. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	dd34200adc	signals: send_signal: use si_fromuser() to detect from_ancestor_ns Change send_signal() to use si_fromuser(). From now SEND_SIG_NOINFO triggers the "from_ancestor_ns" check. This fixes reparent_thread()->group_send_sig_info(pdeath_signal) behaviour, before this patch send_signal() does not detect the cross-namespace case when the child of the dying parent belongs to the sub-namespace. This patch can affect the behaviour of send_sig(), kill_pgrp() and kill_pid() when the caller sends the signal to the sub-namespace with "priv == 0" but surprisingly all callers seem to use them correctly, including disassociate_ctty(on_exit). Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use send_sig(priv => 0). But his is minor and should be fixed anyway. Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	614c517d7c	signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER() No changes in compiled code. The patch adds the new helper, si_fromuser() and changes check_kill_permission() to use this helper. The real effect of this patch is that from now we "officially" consider SEND_SIG_NOINFO signal as "from user-space" signals. This is already true if we look at the code which uses SEND_SIG_NOINFO, except __send_signal() has another opinion - see the next patch. The naming of these special SEND_SIG_XXX siginfo's is really bad imho. From __send_signal()'s pov they mean SEND_SIG_NOINFO from user SEND_SIG_PRIV from kernel SEND_SIG_FORCED no info Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:08 -08:00
Oleg Nesterov	6580807da1	ptrace: copy_process() should disable stepping If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent. This is not right, especially when the new child is not auto-attaced: in this case it is killed by SIGTRAP. Change copy_process() to call user_disable_single_step(). Tested on x86. Test-case: #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <assert.h> int main(void) { int pid, status; if (!(pid = fork())) { assert(ptrace(PTRACE_TRACEME) == 0); kill(getpid(), SIGSTOP); if (!fork()) { /* kernel bug: this child will be killed by SIGTRAP */ printf("Hello world\n"); return 43; } wait(&status); return WEXITSTATUS(status); } for (;;) { assert(pid == wait(&status)); if (WIFEXITED(status)) break; assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0); } assert(WEXITSTATUS(status) == 43); return 0; } Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:08 -08:00
KAMEZAWA Hiroyuki	569b846df5	memcg: coalesce uncharge during unmap/truncate In massive parallel enviroment, res_counter can be a performance bottleneck. One strong techinque to reduce lock contention is reducing calls by coalescing some amount of calls into one. Considering charge/uncharge chatacteristic, - charge is done one by one via demand-paging. - uncharge is done by - in chunk at munmap, truncate, exit, execve... - one by one via vmscan/paging. It seems we have a chance to coalesce uncharges for improving scalability at unmap/truncation. This patch is a for coalescing uncharge. For avoiding scattering memcg's structure to functions under /mm, this patch adds memcg batch uncharge information to the task. A reason for per-task batching is for making use of caller's context information. We do batched uncharge (deleyed uncharge) when truncation/unmap occurs but do direct uncharge when uncharge is called by memory reclaim (vmscan.c). The degree of coalescing depends on callers - at invalidate/trucate... pagevec size - at unmap ....ZAP_BLOCK_SIZE (memory itself will be freed in this degree.) Then, we'll not coalescing too much. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 miss/faults [child with this patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 miss/faults We can see some amounts of improvement. (root cgroup doesn't affected by this patch) Another patch for "charge" will follow this and above will be improved more. Changelog(since 2009/10/02): - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes) - some clean up and commentary/description updates. - added initialize code to copy_process(). (possible bug fix) Changelog(old): - fixed !CONFIG_MEM_CGROUP case. - rebased onto the latest mmotm + softlimit fix patches. - unified patch for callers - added commetns. - make ->do_batch as bool. - removed css_get() at el. We don't need it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:07 -08:00
Alexey Dobriyan	28dfef8feb	const: constify remaining pipe_buf_operations Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:05 -08:00
Barry Song	f065f41f48	timecompare: fix half-Y2K38 problem in timecompare_update while calculating offset ktime will overflow from 03:14:07 UTC on Tuesday, 19 January 2038, ktime_add() in timecompare_update() will overflow a half earlier. As a result, wrong offset will be gotten, then cause some strange problems. Signed-off-by: Barry Song <21cnbao@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Patrick Ohly <patrick.ohly@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:19:57 -08:00
Peter Zijlstra	f13c12c634	perf_events: Fix perf_event_attr layout The miss-alignment of bp_addr created a 32bit hole, causing different structure packings on 32 and 64 bit machines. Fix that by moving __reserve_2 into that hole. Further, remove the useless struct and redundant __bp_reserve muck. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <1260902591.8023.781.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 20:12:20 +01:00
Linus Torvalds	8f0ddf91f2	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits) clockevents: Convert to raw_spinlock clockevents: Make tick_device_lock static debugobjects: Convert to raw_spinlocks perf_event: Convert to raw_spinlock hrtimers: Convert to raw_spinlocks genirq: Convert irq_desc.lock to raw_spinlock smp: Convert smplocks to raw_spinlocks rtmutes: Convert rtmutex.lock to raw_spinlock sched: Convert pi_lock to raw_spinlock sched: Convert cpupri lock to raw_spinlock sched: Convert rt_runtime_lock to raw_spinlock sched: Convert rq->lock to raw_spinlock plist: Make plist debugging raw_spinlock aware bkl: Fixup core_lock fallout locking: Cleanup the name space completely locking: Further name space cleanups alpha: Fix fallout from locking changes locking: Implement new raw_spinlock locking: Convert raw_rwlock functions to arch_rwlock locking: Convert raw_rwlock to arch_rwlock ...	2009-12-15 09:02:01 -08:00
André Goddard Rosa	e7d2860b69	tree-wide: convert open calls to remove spaces to skip_spaces() lib function Makes use of skip_spaces() defined in lib/string.c for removing leading spaces from strings all over the tree. It decreases lib.a code size by 47 bytes and reuses the function tree-wide: text data bss dec hex filename 64688 584 592 65864 10148 (TOTALS-BEFORE) 64641 584 592 65817 10119 (TOTALS-AFTER) Also, while at it, if we see (str && isspace(str)), we can be sure to remove the first condition (str) as the second one (isspace(str)) also evaluates to 0 whenever str == 0, making it redundant. In other words, "a char equals zero is never a space". Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below, and found occurrences of this pattern on 3 more files: drivers/leds/led-class.c drivers/leds/ledtrig-timer.c drivers/video/output.c @@ expression str; @@ ( // ignore skip_spaces cases while (str && isspace(str)) { $str++;\\|++str;$ } \| - str && isspace(*str) ) Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Julia Lawall <julia@diku.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Neil Brown <neilb@suse.de> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: David Howells <dhowells@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:32 -08:00
Bernhard Walle	5ada918b82	vt: introduce and use vt_kmsg_redirect() function The kernel offers with TIOCL_GETKMSGREDIRECT ioctl() the possibility to redirect the kernel messages to a specific console. However, since it's not possible to switch to the kernel message console after a panic(), it would be nice if the kernel would print the panic message on the current console. This patch series adds a new interface to access the global kmsg_redirect variable by a function to be able to use it in code where CONFIG_VT_CONSOLE is not set (kernel/panic.c). This patch: Instead of using and exporting a global value kmsg_redirect, introduce a function vt_kmsg_redirect() that both can set and return the console where messages are printed. Change all users of kmsg_redirect (the VT code itself and kernel/power.c) to the new interface. The main advantage is that vt_kmsg_redirect() can also be used when CONFIG_VT_CONSOLE is not set. Signed-off-by: Bernhard Walle <bernhard@bwalle.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:28 -08:00
H Hartley Sweeten	dfc6a736d4	kernel/sys.c: fix "warning: do-while statement is not a compound statement" noise do_each_thread/while_each_thread wrap a block of code that is in this format: for (...) do ... while If curly braces do not surround the inner loop the following warning is generated by sparse: warning: do-while statement is not a compound statement Fix the warning by adding the braces. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:26 -08:00
Xiao Guangrong	c0f68c2fab	generic-ipi: cleanup for generic_smp_call_function_interrupt() Use smp_processor_id() instead of get_cpu() and put_cpu() in generic_smp_call_function_interrupt(), It's no need to disable preempt, because we must call generic_smp_call_function_interrupt() with interrupts disabled. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:25 -08:00
Amerigo Wang	70da2340fb	'sysctl_max_map_count' should be non-negative Jan Engelhardt reported we have this problem: setting max_map_count to a value large enough results in programs dying at first try. This is on 2.6.31.6: 15:59 borg:/proc/sys/vm # echo $[1<<31-1] >max_map_count 15:59 borg:/proc/sys/vm # cat max_map_count 1073741824 15:59 borg:/proc/sys/vm # echo $[1<<31] >max_map_count 15:59 borg:/proc/sys/vm # cat max_map_count Killed This is because we have a chance to make 'max_map_count' negative. but it's meaningless. Make it only accept non-negative values. Reported-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: James Morris <jmorris@namei.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:23 -08:00
Lee Schermerhorn	06808b0827	hugetlb: derive huge pages nodes allowed from task mempolicy This patch derives a "nodes_allowed" node mask from the numa mempolicy of the task modifying the number of persistent huge pages to control the allocation, freeing and adjusting of surplus huge pages when the pool page count is modified via the new sysctl or sysfs attribute "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows: * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer is produced. This will cause the hugetlb subsystem to use node_online_map as the "nodes_allowed". This preserves the behavior before this patch. * For "preferred" mempolicy, including explicit local allocation, a nodemask with the single preferred node will be produced. "local" policy will NOT track any internode migrations of the task adjusting nr_hugepages. * For "bind" and "interleave" policy, the mempolicy's nodemask will be used. * Other than to inform the construction of the nodes_allowed node mask, the actual mempolicy mode is ignored. That is, all modes behave like interleave over the resulting nodes_allowed mask with no "fallback". See the updated documentation [next patch] for more information about the implications of this patch. Examples: Starting with: Node 0 HugePages_Total: 0 Node 1 HugePages_Total: 0 Node 2 HugePages_Total: 0 Node 3 HugePages_Total: 0 Default behavior [with or without this patch] balances persistent hugepage allocation across nodes [with sufficient contiguous memory]: sysctl vm.nr_hugepages[_mempolicy]=32 yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 8 Node 3 HugePages_Total: 8 Of course, we only have nr_hugepages_mempolicy with the patch, but with default mempolicy, nr_hugepages_mempolicy behaves the same as nr_hugepages. Applying mempolicy--e.g., with numactl [using '-m' a.k.a. '--membind' because it allows multiple nodes to be specified and it's easy to type]--we can allocate huge pages on individual nodes or sets of nodes. So, starting from the condition above, with 8 huge pages per node, add 8 more to node 2 using: numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40 This yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The incremental 8 huge pages were restricted to node 2 by the specified mempolicy. Similarly, we can use mempolicy to free persistent huge pages from specified nodes: numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32 yields: Node 0 HugePages_Total: 4 Node 1 HugePages_Total: 4 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The 8 huge pages freed were balanced over nodes 0 and 1. [rientjes@google.com: accomodate reworked NODEMASK_ALLOC] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:12 -08:00
Alexey Dobriyan	4b731d50ff	bsdacct: fix uid/gid misreporting commit `d8e180dcd5` "bsdacct: switch credentials for writing to the accounting file" introduced credential switching during final acct data collecting. However, uid/gid pair continued to be collected from current which became credentials of who created acct file, not who exits. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Juho K. Juopperi <jkj@kapsi.fi> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:10 -08:00
Paul Mackerras	0f624e7e56	perf_event: Fix incorrect range check on cpu number It is quite legitimate for CPUs to be numbered sparsely, meaning that it possible for an online CPU to have a number which is greater than the total count of possible CPUs. Currently find_get_context() has a sanity check on the cpu number where it checks it against num_possible_cpus(). This test can fail for a legitimate cpu number if the cpu_possible_mask is sparsely populated. This fixes the problem by checking the CPU number against nr_cpumask_bits instead, since that is the appropriate check to ensure that the cpu number is same to pass to cpu_isset() subsequently. Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Tested-by: Michael Neuling <mikey@neuling.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> LKML-Reference: <20091215084032.GA18661@brick.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 13:09:55 +01:00
David Miller	b9f8fcd55b	sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK Relax stable-sched-clock architectures to not save/disable/restore hardirqs in cpu_clock(). The background is that I was trying to resolve a sparc64 perf issue when I discovered this problem. On sparc64 I implement pseudo NMIs by simply running the kernel at IRQ level 14 when local_irq_disable() is called, this allows performance counter events to still come in at IRQ level 15. This doesn't work if any code in an NMI handler does local_irq_save() or local_irq_disable() since the "disable" will kick us back to cpu IRQ level 14 thus letting NMIs back in and we recurse. The only path which that does that in the perf event IRQ handling path is the code supporting frequency based events. It uses cpu_clock(). cpu_clock() simply invokes sched_clock() with IRQs disabled. And that's a fundamental bug all on it's own, particularly for the HAVE_UNSTABLE_SCHED_CLOCK case. NMIs can thus get into the sched_clock() code interrupting the local IRQ disable code sections of it. Furthermore, for the not-HAVE_UNSTABLE_SCHED_CLOCK case, the IRQ disabling done by cpu_clock() is just pure overhead and completely unnecessary. So the core problem is that sched_clock() is not NMI safe, but we are invoking it from NMI contexts in the perf events code (via cpu_clock()). A less important issue is the overhead of IRQ disabling when it isn't necessary in cpu_clock(). CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures are not affected by this patch. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091213.182502.215092085.davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 09:04:36 +01:00
Steven Rostedt	e36c54582c	tracing: Fix return of trace_dump_stack() The trace_dump_stack() returned a value for a void function. Also, added the missing stub for trace_dump_stack() when tracing is not configured. Reported-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20091214162713.GA31060@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 08:36:11 +01:00
Rusty Russell	d4703aefdb	module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y powerpc applies relocations to the kcrctab. They're absolute symbols, but it's not completely unreasonable: other archs may too, but the relocation is often 0. http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-November/077972.html Inspired-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Paul Mackerras <paulus@samba.org>	2009-12-15 16:28:34 +10:30
Thomas Gleixner	b5f91da0a6	clockevents: Convert to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	d192c47f25	clockevents: Make tick_device_lock static Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	e625cce1b7	perf_event: Convert to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	ecb49d1a63	hrtimers: Convert to raw_spinlocks Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	239007b844	genirq: Convert irq_desc.lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	9f5a5621e7	smp: Convert smplocks to raw_spinlocks Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	d209d74d52	rtmutes: Convert rtmutex.lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	1d61548254	sched: Convert pi_lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	fe841226bd	sched: Convert cpupri lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	0986b11b12	sched: Convert rt_runtime_lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	05fa785cf8	sched: Convert rq->lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	a26724591e	plist: Make plist debugging raw_spinlock aware plists are used with spinlocks and raw_spinlocks. Change the plist debugging to handle both types. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	9c1721aa49	locking: Cleanup the name space completely Make the name space hierarchy of locking functions consistent: raw_spin* -> _raw_spin* -> __raw_spin* No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	9828ea9d75	locking: Further name space cleanups The name space hierarchy for the internal lock functions is now a bit backwards. raw_spin* functions map to _spin* which use __spin, while we would like to have _raw_spin and __raw_spin. _raw_spin is already used by lock debugging, so rename those funtions to do_raw_spin* to free up the _raw_spin* name space. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	c2f21ce2e3	locking: Implement new raw_spinlock Now that the raw_spin name space is freed up, we can implement raw_spinlock and the related functions which are used to annotate the locks which are not converted to sleeping spinlocks in preempt-rt. A side effect is that only such locks can be used with the low level lock fsunctions which circumvent lockdep. For !rt spin_* functions are mapped to the raw_spin* implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:32 +01:00
Thomas Gleixner	0199c4e68d	locking: Convert __raw_spin* functions to arch_spin* Name space cleanup. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	edc35bd72e	locking: Rename __RAW_SPIN_LOCK_UNLOCKED to __ARCH_SPIN_LOCK_UNLOCKED Further name space cleanup. No functional change Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	445c89514b	locking: Convert raw_spinlock to arch_spinlock The raw_spin* namespace was taken by lockdep for the architecture specific implementations. raw_spin_* would be the ideal name space for the spinlocks which are not converted to sleeping locks in preempt-rt. Linus suggested to convert the raw_ to arch_ locks and cleanup the name space instead of using an artifical name like core_spin, atomic_spin or whatever No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	b7b40ade58	locking: Reorder functions in spinlock.c Separate spin_lock and rw_lock functions. Preempt-RT needs to exclude the rw_lock functions from being compiled. The reordering allows to do that with a single #ifdef. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:32 +01:00
Linus Torvalds	d0316554d3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits) m68k: rename global variable vmalloc_end to m68k_vmalloc_end percpu: add missing per_cpu_ptr_to_phys() definition for UP percpu: Fix kdump failure if booted with percpu_alloc=page percpu: make misc percpu symbols unique percpu: make percpu symbols in ia64 unique percpu: make percpu symbols in powerpc unique percpu: make percpu symbols in x86 unique percpu: make percpu symbols in xen unique percpu: make percpu symbols in cpufreq unique percpu: make percpu symbols in oprofile unique percpu: make percpu symbols in tracer unique percpu: make percpu symbols under kernel/ and mm/ unique percpu: remove some sparse warnings percpu: make alloc_percpu() handle array types vmalloc: fix use of non-existent percpu variable in put_cpu_var() this_cpu: Use this_cpu_xx in trace_functions_graph.c this_cpu: Use this_cpu_xx for ftrace this_cpu: Use this_cpu_xx in nmi handling this_cpu: Use this_cpu operations in RCU this_cpu: Use this_cpu ops for VM statistics ... Fix up trivial (famous last words) global per-cpu naming conflicts in arch/x86/kvm/svm.c mm/slab.c	2009-12-14 09:58:24 -08:00
Ingo Molnar	0087aabd6a	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent	2009-12-14 17:12:37 +01:00
Thomas Gleixner	1a551ae715	sched: Use rcu in sched_get_rr_param() read_lock(&tasklist_lock) does not protect sys_sched_get_rr_param() against a concurrent update of the policy or scheduler parameters as do_sched_scheduler() does not take the tasklist_lock. The access to task->sched_class->get_rr_interval is protected by task_rq_lock(task). Use rcu_read_lock() to protect find_task_by_vpid() and prevent the task struct from going away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091209100706.862897167@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 17:11:35 +01:00
Thomas Gleixner	23f5d14251	sched: Use rcu in sched_get/set_affinity() tasklist_lock is held read locked to protect the find_task_by_vpid() call and to prevent the task going away. sched_setaffinity acquires a task struct ref and drops tasklist lock right away. The access to the cpus_allowed mask is protected by rq->lock. rcu_read_lock() provides the same protection here. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091209100706.789059966@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 17:11:35 +01:00
Thomas Gleixner	5fe85be081	sched: Use rcu in sys_sched_getscheduler/sys_sched_getparam() read_lock(&tasklist_lock) does not protect sys_sched_getscheduler and sys_sched_getparam() against a concurrent update of the policy or scheduler parameters as do_sched_setscheduler() does not take the tasklist_lock. The accessed integers can be retrieved w/o locking and are snapshots anyway. Using rcu_read_lock() to protect find_task_by_vpid() and prevent the task struct from going away is not changing the above situation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091209100706.753790977@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 17:11:34 +01:00
Ingo Molnar	cc0104e877	Merge branch 'linus' into tracing/urgent Conflicts: kernel/trace/trace_kprobe.c Merge reason: resolve the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 09:16:49 +01:00
Li Zefan	16620e0f19	ksym_tracer: Fix bad cast Fix this warning: kernel/trace/trace_ksym.c: In function 'ksym_trace_filter_read': kernel/trace/trace_ksym.c:239: warning: cast to pointer from integer of different size Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: "K.Prasad" <prasad@linux.vnet.ibm.com> LKML-Reference: <4B1DC578.9020909@cn.fujitsu.com> [remove the strstrip fix as tglx already fixed that] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:46:54 +01:00
Li Zefan	472bbe02c9	tracing/power: Remove two exports trace_power_start and trace_power_end are used in arch/x86/kernel/power.c, and this file can't be compiled as a module, so these two tracepoints don't need to be exported. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC55F.7060305@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:28 +01:00
Li Zefan	e00bf2ec60	tracing: Change event->profile_count to be int type Like total_profile_count, struct ftrace_event_call::profile_count is protected by event_mutex, so it doesn't need to be atomic_t. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4B1DC549.5010705@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:28 +01:00
Li Zefan	8d18eaaff5	tracing: Simplify trace_option_write() - remove duplicate code inside trace_options_write() - extract duplicate code in trace_options_write() and set_tracer_option() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC532.9010802@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:28 +01:00
Li Zefan	2cbafd68b8	tracing: Remove useless trace option Since commit `4d9493c90f` ("ftrace: remove add-hoc code"), option "sched-tree" has become useless. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC50A.7040402@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:27 +01:00
Li Zefan	13f16d2091	tracing: Use seq file for trace_clock The buffer for the output is as small as 64 bytes, so it'll overflow if we add more clock type. Use seq file instead. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC4FB.5030407@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:27 +01:00
Li Zefan	fdb372ed4c	tracing: Use seq file for trace_options Code simplification for reading trace_options. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-reference: <4B1DC4EF.3090106@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:27 +01:00
Li Zefan	91baf6285b	function-graph: Allow writing the same val to set_graph_function # echo 'do_open' > set_graph_function # echo 'do_open' >> set_graph_function bash: echo: write error: Invalid argument Make it valid to write the same value to set_graph_function, which is consistent with set_ftrace_filter interface. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-reference: <4B1DC4E1.1060303@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:26 +01:00
Li Zefan	313254a940	ftrace: Call trace_parser_clear() properly I found a weird behavior: # echo 'fuse:*' > set_ftrace_filter bash: echo: write error: Invalid argument # cat set_ftrace_filter fuse_dev_fasync fuse_dev_poll fuse_copy_do We should call trace_parser_clear() no matter ftrace_process_regex() returns 0 or -errno, otherwise we will actually take the unaccepted records from ftrace_regex_release(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC4D2.3000406@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:26 +01:00
Li Zefan	311d16da57	ftrace: Return EINVAL when writing invalid val to set_ftrace_filter Currently it doesn't warn user on invald value: # echo nonexist_symbol > set_ftrace_filter or: # echo 'nonexist_symbol:mod:fuse' > set_ftrace_filter Better make it return failure. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC4BF.2070003@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:25 +01:00
Li Zefan	3b8e427381	tracing: Move a printk out of ftrace_raw_reg_event_foo() Move the printk from each ftrace_raw_reg_event_foo() to its caller ftrace_event_enable_disable(). This avoids each regfunc trace event callbacks to handle a same error report that can be carried from the caller. See how much space this saves: text data bss dec hex filename 5345151 1961864 7103260 14410275 dbe223 vmlinux.o.old 5331487 1961864 7103260 14396611 dbacc3 vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <4B1DC4AC.802@cn.fujitsu.com> [start cmdline record before calling regfunc to avoid lost window of pid to comm resolution] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:25 +01:00
Li Zefan	614a71a26b	tracing: Pull up calls to trace_define_common_fields() Call trace_define_common_fields() in event_create_dir() only. This avoids trace events to handle it from their define_fields callbacks and shrinks the kernel code size: text data bss dec hex filename 5346802 1961864 7103260 14411926 dbe896 vmlinux.o.old 5345151 1961864 7103260 14410275 dbe223 vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> LKML-Reference: <4B1DC49C.8000107@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:34:23 +01:00
Li Zefan	87d9b4e1c5	tracing: Extract duplicate ftrace_raw_init_event_foo() Use a generic trace_event_raw_init() function for all event's raw_init callbacks (but kprobes) instead of defining the same version for each of these. This shrinks the kernel code: text data bss dec hex filename 5355293 1961928 7103260 14420481 dc0a01 vmlinux.o.old 5346802 1961864 7103260 14411926 dbe896 vmlinux.o raw_init can't be removed, because ftrace events and kprobe events use different raw_init callbacks. Though it's possible to totally remove raw_init, I choose to leave it as it is for now. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> LKML-Reference: <4B1DC48C.7080603@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:34:23 +01:00
Joe Perches	663997d417	sched: Use pr_fmt() and pr_<level>() - Convert printk(KERN_<level> to pr_<level> (not KERN_DEBUG) - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt - Coalesce long format strings - Add missing \n to "ERROR: !SD_LOAD_BALANCE domain has parent" Signed-off-by: Joe Perches <joe@perches.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1260655047.2637.7.camel@Joe-Laptop.home> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-13 08:13:55 +01:00
Rafael J. Wysocki	7539a3b3d1	sched: Make wakeup side and atomic variants of completion API irq safe Alan Stern noticed that all the wakeup side (and atomic) variants of the completion APIs should be irq safe, but the newly introduced completion_done() and try_wait_for_completion() aren't. The use of the irq unsafe variants in IRQ contexts can cause crashes/hangs. Fix the problem by making them use spin_lock_irqsave() and spin_lock_irqrestore(). Reported-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: pm list <linux-pm@lists.linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Chinner <david@fromorbit.com> Cc: Lachlan McIlroy <lachlan@sgi.com> LKML-Reference: <200912130007.30541.rjw@sisk.pl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-13 08:12:46 +01:00
Linus Torvalds	702a7c7609	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (21 commits) sched: Remove forced2_migrations stats sched: Fix memory leak in two error corner cases sched: Fix build warning in get_update_sysctl_factor() sched: Update normalized values on user updates via proc sched: Make tunable scaling style configurable sched: Fix missing sched tunable recalculation on cpu add/remove sched: Fix task priority bug sched: cgroup: Implement different treatment for idle shares sched: Remove unnecessary RCU exclusion sched: Discard some old bits sched: Clean up check_preempt_wakeup() sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call sched: Sanitize fork() handling sched: Clean up ttwu() rq locking sched: Remove rq->clock coupling from set_task_cpu() sched: Consolidate select_task_rq() callers sched: Remove sysctl.sched_features sched: Protect sched_rr_get_param() access to task->sched_class sched: Protect task->cpus_allowed access in sched_getaffinity() sched: Fix balance vs hotplug race ... Fixed up conflicts in kernel/sysctl.c (due to sysctl cleanup)	2009-12-12 11:34:10 -08:00
Sam Ravnborg	273b281fa2	kbuild: move utsrelease.h to include/generated Fix up all users of utsrelease.h Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Michal Marek <mmarek@suse.cz>	2009-12-12 13:08:15 +01:00
Sam Ravnborg	01fc0ac198	kbuild: move bounds.h to include/generated Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Michal Marek <mmarek@suse.cz>	2009-12-12 13:08:14 +01:00
Linus Torvalds	3070f27d6e	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: itimer: Fix the itimer trace print format hrtimer: move timer stats helper functions to hrtimer.c hrtimer: Tune hrtimer_interrupt hang logic	2009-12-11 20:49:09 -08:00
Linus Torvalds	1e57c2186f	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Avoid out of bounds array reference in save_trace() futex: Take mmap_sem for get_user_pages in fault_in_user_writeable lockstat: Add usage info to Documentation/lockstat.txt lockstat: Fix min, max times in /proc/lock_stats	2009-12-11 20:48:21 -08:00
Linus Torvalds	df7147b3c3	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Remove comparing of NULL to va_list in trace_array_vprintk() tracing: Fix function graph trace_pipe to properly display failed entries tracing: Add full state to trace_seq tracing: Buffer the output of seq_file in case of filled buffer tracing: Only call pipe_close if pipe_close is defined tracing: Add pipe_close interface	2009-12-11 20:47:44 -08:00
Linus Torvalds	6f696eb17b	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (57 commits) x86, perf events: Check if we have APIC enabled perf_event: Fix variable initialization in other codepaths perf kmem: Fix unused argument build warning perf symbols: perf_header__read_build_ids() offset'n'size should be u64 perf symbols: dsos__read_build_ids() should read both user and kernel buildids perf tools: Align long options which have no short forms perf kmem: Show usage if no option is specified sched: Mark sched_clock() as notrace perf sched: Add max delay time snapshot perf tools: Correct size given to memset perf_event: Fix perf_swevent_hrtimer() variable initialization perf sched: Fix for getting task's execution time tracing/kprobes: Fix field creation's bad error handling perf_event: Cleanup for cpu_clock_perf_event_update() perf_event: Allocate children's perf_event_ctxp at the right time perf_event: Clean up __perf_event_init_context() hw-breakpoints: Modify breakpoints without unregistering them perf probe: Update perf-probe document perf probe: Support --del option trace-kprobe: Support delete probe syntax ...	2009-12-11 20:47:30 -08:00
Linus Torvalds	0f4974c439	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (58 commits) tty: split the lock up a bit further tty: Move the leader test in disassociate tty: Push the bkl down a bit in the hangup code tty: Push the lock down further into the ldisc code tty: push the BKL down into the handlers a bit tty: moxa: split open lock tty: moxa: Kill the use of lock_kernel tty: moxa: Fix modem op locking tty: moxa: Kill off the throttle method tty: moxa: Locking clean up tty: moxa: rework the locking a bit tty: moxa: Use more tty_port ops tty: isicom: fix deadlock on shutdown tty: mxser: Use the new locking rules to fix setserial properly tty: mxser: use the tty_port_open method tty: isicom: sort out the board init logic tty: isicom: switch to the new tty_port_open helper tty: tty_port: Add a kref object to the tty port tty: istallion: tty port open/close methods tty: stallion: Convert to the tty_port_open/close methods ...	2009-12-11 15:34:40 -08:00
Linus Torvalds	880188b243	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb: Always process the whole breakpoint list on activate or deactivate kgdb: continue and warn on signal passing from gdb kgdb,x86: do not set kgdb_single_step on x86 kgdb: allow for cpu switch when single stepping kgdb,i386: Fix corner case access to ss with NMI watch dog exception kgdb: Replace strstr() by strchr() for single-character needles kgdbts: Read buffer overflow kgdb: Read buffer overflow kgdb,x86: remove redundant test	2009-12-11 15:19:56 -08:00
Alan Cox	5ec93d1154	tty: Move the leader test in disassociate There are two call points, both want to check that tty->signal->leader is set. Move the test into disassociate_ctty() as that will make locking changes easier in a bit Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-12-11 15:18:08 -08:00
Linus Torvalds	11bd04f6f3	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (109 commits) PCI: fix coding style issue in pci_save_state() PCI: add pci_request_acs PCI: fix BUG_ON triggered by logical PCIe root port removal PCI: remove ifdefed pci_cleanup_aer_correct_error_status PCI: unconditionally clear AER uncorr status register during cleanup x86/PCI: claim SR-IOV BARs in pcibios_allocate_resource PCI: portdrv: remove redundant definitions PCI: portdrv: remove unnecessary struct pcie_port_data PCI: portdrv: minor cleanup for pcie_port_device_register PCI: portdrv: add missing irq cleanup PCI: portdrv: enable device before irq initialization PCI: portdrv: cleanup service irqs initialization PCI: portdrv: check capabilities first PCI: portdrv: move PME capability check PCI: portdrv: remove redundant pcie type calculation PCI: portdrv: cleanup pcie_device registration PCI: portdrv: remove redundant pcie_port_device_probe PCI: Always set prefetchable base/limit upper32 registers PCI: read-modify-write the pcie device control register when initiating pcie flr PCI: show dma_mask bits in /sys ... Fixed up conflicts in: arch/x86/kernel/amd_iommu_init.c drivers/pci/dmar.c drivers/pci/hotplug/acpiphp_glue.c	2009-12-11 12:18:16 -08:00
Steven Rostedt	cc51a0fca6	tracing: Add stack trace to irqsoff tracer The irqsoff and friends tracers help in finding causes of latency in the kernel. The also work with the function tracer to show what was happening when interrupts or preemption are disabled. But the function tracer has a bit of an overhead and can cause exagerated readings. Currently, when tracing with /proc/sys/kernel/ftrace_enabled = 0, where the function tracer is disabled, the information that is provided can end up being useless. For example, a 2 and a half millisecond latency only showed: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 2.6.32 # -------------------------------------------------------------------- # latency: 2463 us, #4/4, CPU#2 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # \| task: -4242 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: _spin_lock_irqsave # => ended at: remove_wait_queue # # # _------=> CPU# # / _-----=> irqs-off # \| / _----=> need-resched # \|\| / _---=> hardirq/softirq # \|\|\| / _--=> preempt-depth # \|\|\|\| /_--=> lock-depth # \|\|\|\|\|/ delay # cmd pid \|\|\|\|\|\| time \| caller # \ / \|\|\|\|\|\| \ \| / hackbenc-4242 2d.... 0us!: trace_hardirqs_off <-_spin_lock_irqsave hackbenc-4242 2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue hackbenc-4242 2...1. 2466us : trace_preempt_on <-remove_wait_queue The above lets us know that hackbench with pid 2463 grabbed a spin lock somewhere and enabled preemption at remove_wait_queue. This helps a little but where this actually happened is not informative. This patch adds the stack dump to the end of the irqsoff tracer. This provides the following output: hackbenc-4242 2d.... 0us!: trace_hardirqs_off <-_spin_lock_irqsave hackbenc-4242 2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue hackbenc-4242 2...1. 2466us : trace_preempt_on <-remove_wait_queue hackbenc-4242 2...1. 2467us : <stack trace> => sub_preempt_count => _spin_unlock_irqrestore => remove_wait_queue => free_poll_entry => poll_freewait => do_sys_poll => sys_poll => system_call_fastpath Now we see that the culprit of this latency was the free_poll_entry code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-11 13:19:51 -05:00
Steven Rostedt	03889384ce	tracing: Add trace_dump_stack() I've been asked a few times about how to find out what is calling some location in the kernel. One way is to use dynamic function tracing and implement the func_stack_trace. But this only finds out who is calling a particular function. It does not tell you who is calling that function and entering a specific if conditional. I have myself implemented a quick version of trace_dump_stack() for this purpose a few times, and just needed it now. This is when I realized that this would be a good tool to have in the kernel like trace_printk(). Using trace_dump_stack() is similar to dump_stack() except that it writes to the trace buffer instead and can be used in critical locations. For example: @@ -5485,8 +5485,12 @@ need_resched_nonpreemptible: if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { if (unlikely(signal_pending_state(prev->state, prev))) prev->state = TASK_RUNNING; - else + else { deactivate_task(rq, prev, 1); + trace_printk("Deactivating task %s:%d\n", + prev->comm, prev->pid); + trace_dump_stack(); + } switch_count = &prev->nvcsw; } Produces: <...>-3249 [001] 296.105269: schedule: Deactivating task ntpd:3249 <...>-3249 [001] 296.105270: <stack trace> => schedule => schedule_hrtimeout_range => poll_schedule_timeout => do_select => core_sys_select => sys_select => system_call_fastpath Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-11 10:38:47 -05:00
Jason Wessel	7f8b7ed6f8	kgdb: Always process the whole breakpoint list on activate or deactivate This patch fixes 2 edge cases in using kgdb in conjunction with gdb. 1) kgdb_deactivate_sw_breakpoints() should process the entire array of breakpoints. The failure to do so results in breakpoints that you cannot remove, because a break point can only be removed if its state flag is set to BP_SET. The easy way to duplicate this problem is to plant a break point in a kernel module and then unload the kernel module. 2) kgdb_activate_sw_breakpoints() should process the entire array of breakpoints. The failure to do so results in missed breakpoints when a breakpoint cannot be activated. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:20 -06:00
Jason Wessel	d625e9c0d7	kgdb: continue and warn on signal passing from gdb On some architectures for the segv trap, gdb wants to pass the signal back on continue. For kgdb this is not the default behavior, because it can cause the kernel to crash if you arbitrarily pass back a exception outside of kgdb. Instead of causing instability, pass a message back to gdb about the supported kgdb signal passing and execute a standard kgdb continue operation. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:19 -06:00
Jason Wessel	028e7b1759	kgdb: allow for cpu switch when single stepping The kgdb core should not assume that a single step operation of a kernel thread will complete on the same CPU. The single step flag is set at the "thread" level and it is possible in a multi cpu system that a kernel thread can get scheduled on another cpu the next time it is run. As a further safety net in case a slave cpu is hung, the debug master cpu will try 100 times before giving up and assuming control of the slave cpus is no longer possible. It is more useful to be able to get some information out of kgdb instead of spinning forever. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:17 -06:00
Jason Wessel	84667d4849	kgdb: Read buffer overflow Roel Kluin reported an error found with Parfait. Where we want to ensure that that kgdb_info[-1] never gets accessed. Also check to ensure any negative tid does not exceed the size of the shadow CPU array, else report critical debug context because it is an internal kgdb failure. Reported-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:13 -06:00
Thomas Gleixner	bb6eddf767	clockevents: Prevent clockevent_devices list corruption on cpu hotplug Xiaotian Feng triggered a list corruption in the clock events list on CPU hotplug and debugged the root cause. If a CPU registers more than one per cpu clock event device, then only the active clock event device is removed on CPU_DEAD. The unused devices are kept in the clock events device list. On CPU up the clock event devices are registered again, which means that we list_add an already enqueued list_head. That results in list corruption. Resolve this by removing all devices which are associated to the dead CPU on CPU_DEAD. Reported-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Xiaotian Feng <dfeng@redhat.com> Cc: stable@kernel.org	2009-12-11 10:28:08 +01:00
Steven Rostedt	dd7f594357	ring-buffer: Move resize integrity check under reader lock While using an application that does splice on the ftrace ring buffer at start up, I triggered an integrity check failure. Looking into this, I discovered that resizing the buffer performs an integrity check after the buffer is resized. This check unfortunately is preformed after it releases the reader lock. If a reader is reading the buffer it may cause the integrity check to trigger a false failure. This patch simply moves the integrity checker under the protection of the ring buffer reader lock. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-10 23:20:52 -05:00
Steven Rostedt	184210154b	ring-buffer: Use sync sched protection on ring buffer resizing There was a comment in the ring buffer code that says the calling layers should prevent tracing or reading of the ring buffer while resizing. I have discovered that the tracers do not honor this arrangement. This patch moves the disabling and synchronizing the ring buffer to a higher layer during resizing. This guarantees that no writes are occurring while the resize takes place. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-10 22:54:27 -05:00
Thomas Gleixner	d954fbf0ff	tracing: Fix wrong usage of strstrip in trace_ksyms strstrip returns a pointer to the first non space character, but the code in parse_ksym_trace_str() ignores that. strstrip is now must_check and therefor we get the correct warning: kernel/trace/trace_ksym.c:294: warning: ignoring return value of ‘strstrip’, declared with attribute warn_unused_result We are really not interested in leading whitespace here. Fix that and cleanup the dozen kfree() exit pathes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org>	2009-12-11 00:01:36 +01:00
Thomas Gleixner	d4581a239a	sys: Fix missing rcu protection for __task_cred() access commit `c69e8d9` (CRED: Use RCU to access another task's creds and to release a task's own creds) added non rcu_read_lock() protected access to task creds of the target task in set_prio_one(). The comment above the function says: * - the caller must hold the RCU read lock The calling code in sys_setpriority does read_lock(&tasklist_lock) but not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n. With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick interrupt when they see no read side critical section. There is another instance of __task_cred() in sys_setpriority() itself which is equally unprotected. Wrap the whole code section into a rcu read side critical section to fix this quick and dirty. Will be revisited in course of the read_lock(&tasklist_lock) -> rcu crusade. Oleg noted further: This also fixes another bug here. find_task_by_vpid() is not safe without rcu_read_lock(). I do not mean it is not safe to use the result, just find_pid_ns() by itself is not safe. Usually tasklist gives enough protection, but if copy_process() fails it calls free_pid() lockless and does call_rcu(delayed_put_pid(). This means, without rcu lock find_pid_ns() can't scan the hash table safely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091210004703.029784964@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2009-12-10 23:04:11 +01:00
Thomas Gleixner	7cf7db8df0	signals: Fix more rcu assumptions 1) Remove the misleading comment in __sigqueue_alloc() which claims that holding a spinlock is equivalent to rcu_read_lock(). 2) Add a rcu_read_lock/unlock around the __task_cred() access in __sigqueue_alloc() This needs to be revisited to remove the remaining users of read_lock(&tasklist_lock) but that's outside the scope of this patch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091210004703.269843657@linutronix.de>	2009-12-10 23:04:11 +01:00
Thomas Gleixner	14d8c9f3c0	signal: Fix racy access to __task_cred in kill_pid_info_as_uid() kill_pid_info_as_uid() accesses __task_cred() without being in a RCU read side critical section. tasklist_lock is not protecting that when CONFIG_TREE_PREEMPT_RCU=y. Convert the whole tasklist_lock section to rcu and use lock_task_sighand to prevent the exit race. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091210004703.232302055@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com>	2009-12-10 23:04:11 +01:00
Ingo Molnar	b9889ed1dd	sched: Remove forced2_migrations stats This build warning: kernel/sched.c: In function 'set_task_cpu': kernel/sched.c:2070: warning: unused variable 'old_rq' Made me realize that the forced2_migrations stat looks pretty pointless (and a misnomer) - remove it. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 20:32:39 +01:00
Linus Torvalds	d71cb81af3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Add debugobjects support	2009-12-10 09:35:44 -08:00
Xiao Guangrong	5e855db5d8	perf_event: Fix variable initialization in other codepaths Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B20BAA6.7010609@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 17:23:02 +01:00
Phil Carmody	dfc12eb26a	sched: Fix memory leak in two error corner cases If the second in each of these pairs of allocations fails, then the first one will not be freed in the error route out. Found by a static code analysis tool. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1260448177-28448-1-git-send-email-ext-phil.2.carmody@nokia.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 14:28:10 +01:00
Heiko Carstens	5f201907df	hrtimer: move timer stats helper functions to hrtimer.c There is no reason to make timer_stats_hrtimer_set_start_info and friends visible to the rest of the kernel. So move all of them to hrtimer.c. Also make timer_stats_hrtimer_set_start_info a static inline function so it gets inlined and we avoid another function call. Based on a patch by Thomas Gleixner. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> LKML-Reference: <20091210095629.GC4144@osiris.boeblingen.de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-12-10 13:08:11 +01:00
Thomas Gleixner	41d2e49493	hrtimer: Tune hrtimer_interrupt hang logic The hrtimer_interrupt hang logic adjusts min_delta_ns based on the execution time of the hrtimer callbacks. This is error-prone for virtual machines, where a guest vcpu can be scheduled out during the execution of the callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor), which in can lead to large min_delta_ns rendering the system unusable. Replace the current heuristics with something more reliable. Allow the interrupt code to try 3 times to catch up with the lost time. If that fails use the total time spent in the interrupt handler to defer the next timer interrupt so the system can catch up with other things which got delayed. Limit that deferment to 100ms. The retry events and the maximum time spent in the interrupt handler are recorded and exposed via /proc/timer_list Inspired by a patch from Marcelo. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: kvm@vger.kernel.org	2009-12-10 13:08:11 +01:00
Mike Galbraith	4ca3ef71f5	sched: Fix build warning in get_update_sysctl_factor() Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <new-submission>	2009-12-10 09:34:50 +01:00
Luck, Tony	ea5b41f9d5	lockdep: Avoid out of bounds array reference in save_trace() ia64 found this the hard way (because we currently have a stub for save_stack_trace() that does nothing). But it would be a good idea to be cautious in case a real save_stack_trace() bailed out with an error before it set trace->nr_entries. Signed-off-by: Tony Luck <tony.luck@intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: luming.yu@intel.com LKML-Reference: <4b2024d085302c2a2@agluck-desktop.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 08:29:33 +01:00
Ingo Molnar	788d70dce0	Merge branch 'tip/tracing/core3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2009-12-10 08:18:41 +01:00
Xiao Guangrong	21140f4d33	perf_event: Fix perf_swevent_hrtimer() variable initialization fix: [<c0477471>] ? printk+0x1d/0x24 [<c01c98f9>] ? perf_prepare_sample+0x269/0x280 [<c0149231>] warn_slowpath_common+0x71/0xd0 [<c01c98f9>] ? perf_prepare_sample+0x269/0x280 [<c01492aa>] warn_slowpath_null+0x1a/0x20 [<c01c98f9>] perf_prepare_sample+0x269/0x280 [<c016e9f3>] ? cpu_clock+0x53/0x90 [<c01cc368>] __perf_event_overflow+0x2a8/0x300 [<c01ccc3b>] perf_event_overflow+0x1b/0x30 [<c01ccccf>] perf_swevent_hrtimer+0x7f/0x120 This is because 'data.raw' variable not initialize. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B208E93.1010801@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 07:11:05 +01:00
Linus Torvalds	4ef58d4e2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...	2009-12-09 19:43:33 -08:00
Thomas Gleixner	86fc80f16e	capabilities: Use RCU to protect task lookup in sys_capget cap_get_target_pid() protects the task lookup with tasklist_lock. security_capget() is called under tasklist_lock as well but tasklist_lock does not protect anything there. The capabilities are protected by RCU already. So tasklist_lock only protects the lookup and prevents the task going away, which can be done with rcu_read_lock() as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: James Morris <jmorris@namei.org>	2009-12-10 09:42:48 +11:00
Carsten Emde	f2942487ff	tracing: Remove comparing of NULL to va_list in trace_array_vprintk() Olof Johansson stated the following: Comparing a va_list with NULL is bogus. It's supposed to be treated like an opaque type and only be manipulated with va_* accessors. Olof noticed that this code broke the ARM builds: kernel/trace/trace.c: In function 'trace_array_vprintk': kernel/trace/trace.c:1364: error: invalid operands to binary == (have 'va_list' and 'void *') kernel/trace/trace.c: In function 'tracing_mark_write': kernel/trace/trace.c:3349: error: incompatible type for argument 3 of 'trace_vprintk' This patch partly reverts `c13d2f7c32` and re-installs the original mark_printk() mechanism. Reported-by: Olof Johansson <olof@lixom.net> Signed-off-by: Carsten Emde <C.Emde@osadl.org> LKML-Reference: <4B1BAB74.104@osadl.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 14:20:08 -05:00
Jiri Olsa	be1eca3931	tracing: Fix function graph trace_pipe to properly display failed entries There is a case where the graph tracer might get confused and omits displaying of a single record. This applies mostly with the trace_pipe since it is unlikely that the trace_seq buffer will overflow with the trace file. As the function_graph tracer goes through the trace entries keeping a pointer to the current record: current -> func1 ENTRY func2 ENTRY func2 RETURN func1 RETURN When an function ENTRY is encountered, it moves the pointer to the next entry to check if the function is a nested or leaf function. func1 ENTRY current -> func2 ENTRY func2 RETURN func1 RETURN If the rest of the writing of the function fills the trace_seq buffer, then the trace_pipe read will ignore this entry. The next read will Now start at the current location, but the first entry (func1) will be discarded. This patch keeps a copy of the current entry in the iterator private storage and will keep track of when the trace_seq buffer fills. When the trace_seq buffer fills, it will reuse the copy of the entry in the next iteration. [ This patch has been largely modified by Steven Rostedt in order to clean it up and simplify it. The original idea and concept was from Jirka and for that, this patch will go under his name to give him the credit he deserves. But because this was modify by Steven Rostedt anything wrong with the patch should be blamed on Steven. ] Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259067458-27143-1-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 14:09:06 -05:00
Johannes Berg	d184b31c0e	tracing: Add full state to trace_seq The trace_seq buffer might fill up, and right now one needs to check the return value of each printf into the buffer to check for that. Instead, have the buffer keep track of whether it is full or not, and reject more input if it is full or would have overflowed with an input that wasn't added. Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 14:05:49 -05:00
Steven Rostedt	a63ce5b306	tracing: Buffer the output of seq_file in case of filled buffer If the seq_read fills the buffer it will call s_start again on the next itertation with the same position. This causes a problem with the function_graph tracer because it consumes the iteration in order to determine leaf functions. What happens is that the iterator stores the entry, and the function graph plugin will look at the next entry. If that next entry is a return of the same function and task, then the function is a leaf and the function_graph plugin calls ring_buffer_read which moves the ring buffer iterator forward (the trace iterator still points to the function start entry). The copying of the trace_seq to the seq_file buffer will fail if the seq_file buffer is full. The seq_read will not show this entry. The next read by userspace will cause seq_read to again call s_start which will reuse the trace iterator entry (the function start entry). But the function return entry was already consumed. The function graph plugin will think that this entry is a nested function and not a leaf. To solve this, the trace code now checks the return status of the seq_printf (trace_print_seq). If the writing to the seq_file buffer fails, we set a flag in the iterator (leftover) and we do not reset the trace_seq buffer. On the next call to s_start, we check the leftover flag, and if it is set, we just reuse the trace_seq buffer and do not call into the plugin print functions. Before this patch: 2) \| fput() { 2) \| __fput() { 2) 0.550 us \| inotify_inode_queue_event(); 2) \| __fsnotify_parent() { 2) 0.540 us \| inotify_dentry_parent_queue_event(); After the patch: 2) \| fput() { 2) \| __fput() { 2) 0.550 us \| inotify_inode_queue_event(); 2) 0.548 us \| __fsnotify_parent(); 2) 0.540 us \| inotify_dentry_parent_queue_event(); [ Updated the patch to fix a missing return 0 from the trace_print_seq() stub when CONFIG_TRACING is disabled. Reported-by: Ingo Molnar <mingo@elte.hu> ] Reported-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 13:55:26 -05:00
Steven Rostedt	29bf4a5e3f	tracing: Only call pipe_close if pipe_close is defined This fixes a cut and paste error that had pipe_close get called if pipe_open was defined (not pipe_close). Reported-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> LKML-Reference: <20091209153204.F4CD.A69D9226@jp.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 12:47:35 -05:00
Linus Torvalds	3b8ecd2244	Merge branch 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sys: Remove BKL from sys_reboot pm_qos: clean up racy global "name" variable pm_qos: remove BKL	2009-12-09 08:07:17 -08:00
Thomas Gleixner	f409adf5b1	futex: Protect pid lookup in compat code with RCU find_task_by_vpid() in compat_sys_get_robust_list() does not require tasklist_lock. It can be protected with rcu_read_lock as done in sys_get_robust_list() already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org>	2009-12-09 14:22:14 +01:00
Frederic Weisbecker	822a696111	tracing/kprobes: Fix field creation's bad error handling When we define the common event fields in kprobe, we invert the error handling and return immediately in case of success. Then we omit to define specific kprobes fields (ip and nargs), and specific kretprobes fields (func, ret_ip, nargs). And we only define them when we fail to create common fields. The most visible consequence is that we can't create filter for k(ret)probes specific fields. This patch re-invert the success/error handling to fix it. Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1260263815-5167-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:32:21 +01:00
Christian Ehrhardt	acb4a848da	sched: Update normalized values on user updates via proc The normalized values are also recalculated in case the scaling factor changes. This patch updates the internally used scheduler tuning values that are normalized to one cpu in case a user sets new values via sysfs. Together with patch 2 of this series this allows to let user configured values scale (or not) to cpu add/remove events taking place later. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-4-git-send-email-ehrhardt@linux.vnet.ibm.com> [ v2: fix warning ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:04:02 +01:00
Christian Ehrhardt	1983a922a1	sched: Make tunable scaling style configurable As scaling now takes place on all kind of cpu add/remove events a user that configures values via proc should be able to configure if his set values are still rescaled or kept whatever happens. As the comments state that log2 was just a second guess that worked the interface is not just designed for on/off, but to choose a scaling type. Currently this allows none, log and linear, but more important it allwos us to keep the interface even if someone has an even better idea how to scale the values. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-3-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:04:01 +01:00
Christian Ehrhardt	0bcdcf28c9	sched: Fix missing sched tunable recalculation on cpu add/remove Based on Peter Zijlstras patch suggestion this enables recalculation of the scheduler tunables in response of a change in the number of cpus. It also adds a max of eight cpus that are considered in that scaling. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:58 +01:00
Peter Zijlstra	57785df5ac	sched: Fix task priority bug 83f9ac removed a call to effective_prio() in wake_up_new_task(), which leads to tasks running at MAX_PRIO. This is caused by the idle thread being set to MAX_PRIO before forking off init. O(1) used that to make sure idle was always preempted, CFS uses check_preempt_curr_idle() for that so we can savely remove this bit of legacy code. Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259754383.4003.610.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:10 +01:00
Peter Zijlstra	cd8ad40de3	sched: cgroup: Implement different treatment for idle shares When setting the weight for a per-cpu task-group, we have to put in a phantom weight when there is no work on that cpu, otherwise we'll not service that cpu when new work gets placed there until we again update the per-cpu weights. We used to add these phantom weights to the total, so that the idle per-cpu shares don't get inflated, this however causes the non-idle parts to get deflated, causing unexpected weight distibutions. Reverse this, so that the non-idle shares are correct but the idle shares are inflated. Reported-by: Yasunori Goto <y-goto@jp.fujitsu.com> Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1257934048.23203.76.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:09 +01:00
Peter Zijlstra	fb58bac5c7	sched: Remove unnecessary RCU exclusion As Nick pointed out, and realized by myself when doing: sched: Fix balance vs hotplug race the patch: sched: for_each_domain() vs RCU is wrong, sched_domains are freed after synchronize_sched(), which means disabling preemption is enough. Reported-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:08 +01:00
Peter Zijlstra	6cecd084d0	sched: Discard some old bits WAKEUP_RUNNING was an experiment, not sure why that ever ended up being merged... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:07 +01:00
Peter Zijlstra	3a7e73a2e2	sched: Clean up check_preempt_wakeup() Streamline the wakeup preemption code a bit, unifying the preempt path so that they all do the same. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:07 +01:00
Jupyung Lee	a65ac745e4	sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call If a RT task is woken up while a non-RT task is running, check_preempt_wakeup() is called to check whether the new task can preempt the old task. The function returns quickly without going deeper because it is apparent that a RT task can always preempt a non-RT task. In this situation, check_preempt_wakeup() always calls update_curr() to update vruntime value of the currently running task. However, the function call is unnecessary and redundant at that moment because (1) a non-RT task can always be preempted by a RT task regardless of its vruntime value, and (2) update_curr() will be called shortly when the context switch between two occurs. By moving update_curr() in check_preempt_wakeup(), we can avoid redundant call to update_curr(), slightly reducing the time taken to wake up RT tasks. Signed-off-by: Jupyung Lee <jupyung@gmail.com> [ Place update_curr() right before the wake_preempt_entity() call, which is the only thing that relies on the updated vruntime ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1258451500-6714-1-git-send-email-jupyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:06 +01:00
Peter Zijlstra	cd29fe6f26	sched: Sanitize fork() handling Currently we try to do task placement in wake_up_new_task() after we do the load-balance pass in sched_fork(). This yields complicated semantics in that we have to deal with tasks on different RQs and the set_task_cpu() calls in copy_process() and sched_fork() Rename ->task_new() to ->task_fork() and call it from sched_fork() before the balancing, this gives the policy a clear point to place the task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:05 +01:00
Peter Zijlstra	ab19cb2331	sched: Clean up ttwu() rq locking Since set_task_clock() doesn't rely on rq->clock anymore we can simplyfy the mess in ttwu(). Optimize things a bit by not fiddling with the IRQ state there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:04 +01:00
Peter Zijlstra	5afcdab706	sched: Remove rq->clock coupling from set_task_cpu() set_task_cpu() should be rq invariant and only touch task state, it currently fails to do so, which opens up a few races, since not all callers hold both rq->locks. Remove the relyance on rq->clock, as any site calling set_task_cpu() should also do a remote clock update, which should ensure the observed time between these two cpus is monotonic, as per kernel/sched_clock.c:sched_clock_remote(). Therefore we can simply remove the clock_offset bits and be happy. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:03 +01:00
Peter Zijlstra	970b13bacb	sched: Consolidate select_task_rq() callers Small cleanup. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> [ v2: build fix ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:02 +01:00
Peter Zijlstra	6b314d0e11	sched: Remove sysctl.sched_features Since we've had a much saner debugfs interface to this, remove the sysctl one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> [ v2: build fix ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:01 +01:00
Thomas Gleixner	dba091b9e3	sched: Protect sched_rr_get_param() access to task->sched_class sched_rr_get_param calls task->sched_class->get_rr_interval(task) without protection against a concurrent sched_setscheduler() call which modifies task->sched_class. Serialize the access with task_rq_lock(task) and hand the rq pointer into get_rr_interval() as it's needed at least in the sched_fair implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:01:07 +01:00
Thomas Gleixner	3160568371	sched: Protect task->cpus_allowed access in sched_getaffinity() sched_getaffinity() is not protected against a concurrent modification of the tasks affinity. Serialize the access with task_rq_lock(task). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091208202026.769251187@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:01:06 +01:00
Xiao Guangrong	ec89a06fd4	perf_event: Cleanup for cpu_clock_perf_event_update() Using atomic64_xchg() instead of atomic64_read() and atomic64_set(). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B1F19DC.90204@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:56:27 +01:00
Xiao Guangrong	b93f7978ad	perf_event: Allocate children's perf_event_ctxp at the right time In current code, children task will allocate memory for 'child->perf_event_ctxp' if the parent is counted, we can do it only if the parent allowed children inherit it. It can save memory and reduce overhead. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B1F19A8.5040805@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:56:27 +01:00
Xiao Guangrong	aa5452d70c	perf_event: Clean up __perf_event_init_context() Clean up the code a bit: - define 'perf_cpu_context' variable with 'static' - use kzalloc() instead of kmalloc() and memset() Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B1F194D.7080306@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:56:27 +01:00
Frederic Weisbecker	44234adcdc	hw-breakpoints: Modify breakpoints without unregistering them Currently, when ptrace needs to modify a breakpoint, like disabling it, changing its address, type or len, it calls modify_user_hw_breakpoint(). This latter will perform the heavy and racy task of unregistering the old breakpoint and registering a new one. This is racy as someone else might steal the reserved breakpoint slot under us, which is undesired as the breakpoint is only supposed to be modified, sometimes in the middle of a debugging workflow. We don't want our slot to be stolen in the middle. So instead of unregistering/registering the breakpoint, just disable it while we modify its breakpoint fields and re-enable it after if necessary. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1260347148-5519-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:48:20 +01:00
Masami Hiramatsu	a7c312bed7	trace-kprobe: Support delete probe syntax Support delete probe syntax. The syntax is "-:[group/]event". Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> LKML-Reference: <20091208220316.10142.39192.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>	2009-12-09 07:26:53 +01:00
Linus Torvalds	2b876f95d0	Merge branches 'timers-for-linus-ntp' and 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-ntp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ntp: Provide compability defines (You say MOD_NANO, I say ADJ_NANO) * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: do not execute DEBUG_SHIRQ when irq setup failed	2009-12-08 19:30:19 -08:00
Linus Torvalds	fbf07eac7b	Merge branch 'timers-for-linus-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: Fix /proc/timer_list regression itimers: Fix racy writes to cpu_itimer fields timekeeping: Fix clock_gettime vsyscall time warp	2009-12-08 19:28:09 -08:00
Linus Torvalds	60d8ce2cd6	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers, init: Limit the number of per cpu calibration bootup messages posix-cpu-timers: optimize and document timer_create callback clockevents: Add missing include to pacify sparse x86: vmiclock: Fix printk format x86: Fix printk format due to variable type change sparc: fix printk for change of variable type clocksource/events: Fix fallout of generic code changes nohz: Allow 32-bit machines to sleep for more than 2.15 seconds nohz: Track last do_timer() cpu nohz: Prevent clocksource wrapping during idle nohz: Type cast printk argument mips: Use generic mult/shift factor calculation for clocks clocksource: Provide a generic mult/shift factor calculation clockevents: Use u32 for mult and shift factors nohz: Introduce arch_needs_cpu nohz: Reuse ktime in sub-functions of tick_check_idle. time: Remove xtime_cache time: Implement logarithmic time accumulation	2009-12-08 19:27:08 -08:00
Linus Torvalds	4d2a914239	Merge branch 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: core: Clean up user return notifers use of per_cpu	2009-12-08 13:24:20 -08:00
Linus Torvalds	6035ccd8e9	Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits) cfq-iosched: Do not access cfqq after freeing it block: include linux/err.h to use ERR_PTR cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit blkio: Allow CFQ group IO scheduling even when CFQ is a module blkio: Implement dynamic io controlling policy registration blkio: Export some symbols from blkio as its user CFQ can be a module block: Fix io_context leak after failure of clone with CLONE_IO block: Fix io_context leak after clone with CLONE_IO cfq-iosched: make nonrot check logic consistent io controller: quick fix for blk-cgroup and modular CFQ cfq-iosched: move IO controller declerations to a header file cfq-iosched: fix compile problem with !CONFIG_CGROUP blkio: Documentation blkio: Wait on sync-noidle queue even if rq_noidle = 1 blkio: Implement group_isolation tunable blkio: Determine async workload length based on total number of queues blkio: Wait for cfq queue to get backlogged if group is empty blkio: Propagate cgroup weight updation to cfq groups blkio: Drop the reference to queue once the task changes cgroup blkio: Provide some isolation between groups ...	2009-12-08 08:19:16 -08:00
Linus Torvalds	dad3de7d00	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Add flag for devices capable of generating run-time wake-up events PM / Runtime: Remove unnecessary braces in __pm_runtime_set_status() PM / Runtime: Make documentation of runtime_idle() agree with the code PM / Runtime: Ensure timer_expires is nonzero in pm_schedule_suspend() PM / Runtime: Use deferred_resume flag in pm_request_resume PM / Runtime: Export the PM runtime workqueue PM / Runtime: Fix lockdep warning in __pm_runtime_set_status() PM / Hibernate: Swap, use KERN_CONT PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c PM / Hibernate: Move swap functions to kernel/power/swap.c. PM / freezer: Don't get over-anxious while waiting	2009-12-08 08:07:16 -08:00
Linus Torvalds	ed9216c171	Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits) KVM: VMX: Fix comparison of guest efer with stale host value KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c KVM: Drop user return notifier when disabling virtualization on a cpu KVM: VMX: Disable unrestricted guest when EPT disabled KVM: x86 emulator: limit instructions to 15 bytes KVM: s390: Make psw available on all exits, not just a subset KVM: x86: Add KVM_GET/SET_VCPU_EVENTS KVM: VMX: Report unexpected simultaneous exceptions as internal errors KVM: Allow internal errors reported to userspace to carry extra data KVM: Reorder IOCTLs in main kvm.h KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG KVM: only clear irq_source_id if irqchip is present KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic KVM: x86: disallow multiple KVM_CREATE_IRQCHIP KVM: VMX: Remove vmx->msr_offset_efer KVM: MMU: update invlpg handler comment KVM: VMX: move CR3/PDPTR update to vmx_set_cr3 KVM: remove duplicated task_switch check KVM: powerpc: Fix BUILD_BUG_ON condition KVM: VMX: Use shared msr infrastructure ... Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile	2009-12-08 08:02:38 -08:00
Linus Torvalds	d7fc02c7ba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits) mac80211: fix reorder buffer release iwmc3200wifi: Enable wimax core through module parameter iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter iwmc3200wifi: Coex table command does not expect a response iwmc3200wifi: Update wiwi priority table iwlwifi: driver version track kernel version iwlwifi: indicate uCode type when fail dump error/event log iwl3945: remove duplicated event logging code b43: fix two warnings ipw2100: fix rebooting hang with driver loaded cfg80211: indent regulatory messages with spaces iwmc3200wifi: fix NULL pointer dereference in pmkid update mac80211: Fix TX status reporting for injected data frames ath9k: enable 2GHz band only if the device supports it airo: Fix integer overflow warning rt2x00: Fix padding bug on L2PAD devices. WE: Fix set events not propagated b43legacy: avoid PPC fault during resume b43: avoid PPC fault during resume tcp: fix a timewait refcnt race ... Fix up conflicts due to sysctl cleanups (dead sysctl_check code and CTL_UNNUMBERED removed) in kernel/sysctl_check.c net/ipv4/sysctl_net_ipv4.c net/ipv6/addrconf.c net/sctp/sysctl.c	2009-12-08 07:55:01 -08:00
Linus Torvalds	1557d33007	Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits) security/tomoyo: Remove now unnecessary handling of security_sysctl. security/tomoyo: Add a special case to handle accesses through the internal proc mount. sysctl: Drop & in front of every proc_handler. sysctl: Remove CTL_NONE and CTL_UNNUMBERED sysctl: kill dead ctl_handler definitions. sysctl: Remove the last of the generic binary sysctl support sysctl net: Remove unused binary sysctl code sysctl security/tomoyo: Don't look at ctl_name sysctl arm: Remove binary sysctl support sysctl x86: Remove dead binary sysctl support sysctl sh: Remove dead binary sysctl support sysctl powerpc: Remove dead binary sysctl support sysctl ia64: Remove dead binary sysctl support sysctl s390: Remove dead sysctl binary support sysctl frv: Remove dead binary sysctl support sysctl mips/lasat: Remove dead binary sysctl support sysctl drivers: Remove dead binary sysctl support sysctl crypto: Remove dead binary sysctl support sysctl security/keys: Remove dead binary sysctl support sysctl kernel: Remove binary sysctl logic ...	2009-12-08 07:38:50 -08:00
Andi Kleen	722d017237	futex: Take mmap_sem for get_user_pages in fault_in_user_writeable get_user_pages() must be called with mmap_sem held. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: stable@kernel.org Cc: Andrew Morton <akpm@linuxfoundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091208121942.GA21298@basil.fritz.box> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-12-08 14:59:36 +01:00
Stephen Rothwell	6ab8886326	perf: hw_breakpoints: Fix percpu namespace clash Today's linux-next build failed with: kernel/hw_breakpoint.c:86: error: 'task_bp_pinned' redeclared as different kind of symbol ... Caused by commit `dd17c8f729` ("percpu: remove per_cpu__ prefix") from the percpu tree interacting with commit `56053170ea` ("hw-breakpoints: Fix task-bound breakpoint slot allocation") from the tip tree. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20091208182515.bb6dda4a.sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-08 09:34:43 +01:00
Tejun Heo	50de1a8ef1	Merge branch 'for-linus' into for-next Conflicts: mm/percpu.c	2009-12-08 10:02:12 +09:00
Jiri Kosina	d014d04386	Merge branch 'for-next' into for-linus Conflicts: kernel/irq/chip.c	2009-12-07 18:36:35 +01:00
Steven Rostedt	c521efd170	tracing: Add pipe_close interface An ftrace plugin can add a pipe_open interface when the user opens trace_pipe. But if the plugin allocates something within the pipe_open it can not free it because there exists no pipe_close. The hook to the trace file open has a corresponding close. The closing of the trace_pipe file should also have a corresponding close. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-07 12:01:35 -05:00
Frederic Weisbecker	56053170ea	hw-breakpoints: Fix task-bound breakpoint slot allocation Whatever the context nature of a breakpoint, we always perform the following constraint checks before allocating it a slot: - Check the number of pinned breakpoint bound the concerned cpus - Check the max number of task-bound breakpoints that are belonging to a task. - Add both and see if we have a reamining slot for the new breakpoint This is the right thing to do when we are about to register a cpu-only bound breakpoint. But not if we are dealing with a task bound breakpoint. What we want in this case is: - Check the number of pinned breakpoint bound the concerned cpus - Check the number of breakpoints that already belong to the task in which the breakpoint to register is bound to. - Add both This fixes a regression that makes the "firefox -g" command fail to register breakpoints once we deal with a secondary thread. Reported-by: Walt <w41ter@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com>	2009-12-07 07:05:28 +01:00
Peter Zijlstra	6ad4c18884	sched: Fix balance vs hotplug race Since (`e761b77`: cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment) we have cpu_active_mask which is suppose to rule scheduler migration and load-balancing, except it never (fully) did. The particular problem being solved here is a crash in try_to_wake_up() where select_task_rq() ends up selecting an offline cpu because select_task_rq_fair() trusts the sched_domain tree to reflect the current state of affairs, similarly select_task_rq_rt() trusts the root_domain. However, the sched_domains are updated from CPU_DEAD, which is after the cpu is taken offline and after stop_machine is done. Therefore it can race perfectly well with code assuming the domains are right. Cure this by building the domains from cpu_active_mask on CPU_DOWN_PREPARE. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-06 21:10:56 +01:00
Geert Uytterhoeven	e1b8090bdf	cpumask: Fix generate_sched_domains() for UP Commit `acc3f5d7ca` ("cpumask: Partition_sched_domains takes array of cpumask_var_t") changed the function signature of generate_sched_domains() for the CONFIG_SMP=y case, but forgot to update the corresponding function for the CONFIG_SMP=n case, causing: kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <alpine.DEB.2.00.0912062038070.5693@ayla.of.borg> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-06 21:08:41 +01:00
Alan Stern	7b199ca202	PM / Runtime: Export the PM runtime workqueue This patch (as1306) exports the PM runtime workqueue for use by loadable modules. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:17:56 +01:00
Jiri Slaby	66d0ae4d6f	PM / Hibernate: Swap, use KERN_CONT Use KERN_CONT in save_image() for printks, so that anybody won't try to add a loglevel. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:16:24 +01:00
Nigel Cunningham	8e60c6a134	PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c Shift the remaining declaration of the variable in_suspend and the function swsusp_show_speed from swsusp.c to hibernate.c, and delete swsusp.c. Signed-off-by: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:16:07 +01:00
Nigel Cunningham	0414f2ec03	PM / Hibernate: Move swap functions to kernel/power/swap.c. Move hibernation code's functions for allocating and freeing swap from swsusp.c to swap.c, which is where you'd expect to find them. Signed-off-by: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:15:53 +01:00
Rafael J. Wysocki	64357ed468	Merge branch 'master' into for-linus	2009-12-06 16:06:11 +01:00
Frank Rowand	109d71c6dd	lockstat: Fix min, max times in /proc/lock_stats Fix min, max times in /proc/lock_stats (1) When collecting lock hold and wait times, if the current minimum time is zero, it will be replaced by the next time. (2) When aggregating minimum and maximum lock hold and wait times accross cpus, the values are added, instead of selecting the minimum and maximum. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4B05BBAE.2050005@am.sony.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-06 13:20:00 +01:00
Frederic Weisbecker	b326e9560a	hw-breakpoints: Use overflow handler instead of the event callback struct perf_event::event callback was called when a breakpoint triggers. But this is a rather opaque callback, pretty tied-only to the breakpoint API and not really integrated into perf as it triggers even when we don't overflow. We prefer to use overflow_handler() as it fits into the perf events rules, being called only when we overflow. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>	2009-12-06 08:27:18 +01:00
Frederic Weisbecker	2f0993e0fb	hw-breakpoints: Drop callback and task parameters from modify helper Drop the callback and task parameters from modify_user_hw_breakpoint(). For now we have no user that need to modify a breakpoint to the point of changing its handler or its task context. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>	2009-12-06 08:27:17 +01:00
Linus Torvalds	897e81bea1	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits) sched, cputime: Introduce thread_group_times() sched, cputime: Cleanups related to task_times() Revert "sched, x86: Optimize branch hint in __switch_to()" sched: Fix isolcpus boot option sched: Revert `498657a478` sched, time: Define nsecs_to_jiffies() sched: Remove task_{u,s,g}time() sched: Introduce task_times() to replace task_{u,s}time() pair sched: Limit the number of scheduler debug messages sched.c: Call debug_show_all_locks() when dumping all tasks sched, x86: Optimize branch hint in __switch_to() sched: Optimize branch hint in context_switch() sched: Optimize branch hint in pick_next_task_fair() sched_feat_write(): Update ppos instead of file->f_pos sched: Sched_rt_periodic_timer vs cpu hotplug sched, kvm: Fix race condition involving sched_in_preempt_notifers sched: More generic WAKE_AFFINE vs select_idle_sibling() sched: Cleanup select_task_rq_fair() sched: Fix granularity of task_u/stime() sched: Fix/add missing update_rq_clock() calls ...	2009-12-05 15:30:49 -08:00
Linus Torvalds	c3fa27d136	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (470 commits) x86: Fix comments of register/stack access functions perf tools: Replace %m with %a in sscanf hw-breakpoints: Keep track of user disabled breakpoints tracing/syscalls: Make syscall events print callbacks static tracing: Add DEFINE_EVENT(), DEFINE_SINGLE_EVENT() support to docbook perf: Don't free perf_mmap_data until work has been done perf_event: Fix compile error perf tools: Fix _GNU_SOURCE macro related strndup() build error trace_syscalls: Remove unused syscall_name_to_nr() trace_syscalls: Simplify syscall profile trace_syscalls: Remove duplicate init_enter_##sname() trace_syscalls: Add syscall_nr field to struct syscall_metadata trace_syscalls: Remove enter_id exit_id trace_syscalls: Set event_enter_##sname->data to its metadata trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit perf_event: Initialize data.period in perf_swevent_hrtimer() perf probe: Simplify event naming perf probe: Add --list option for listing current probe events perf probe: Add argv_split() from lib/argv_split.c perf probe: Move probe event utility functions to probe-event.c ...	2009-12-05 15:30:21 -08:00
David S. Miller	28b4d5cc17	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/net/pcmcia/fmvj18x_cs.c drivers/net/pcmcia/nmclan_cs.c drivers/net/pcmcia/xirc2ps_cs.c drivers/net/wireless/ray_cs.c	2009-12-05 15:22:26 -08:00
Linus Torvalds	96fa2b508d	Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (40 commits) tracing: Separate raw syscall from syscall tracer ring-buffer-benchmark: Add parameters to set produce/consumer priorities tracing, function tracer: Clean up strstrip() usage ring-buffer benchmark: Run producer/consumer threads at nice +19 tracing: Remove the stale include/trace/power.h tracing: Only print objcopy version warning once from recordmcount tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used ring-buffer: Move access to commit_page up into function used tracing: do not disable interrupts for trace_clock_local ring-buffer: Add multiple iterations between benchmark timestamps kprobes: Sanitize struct kretprobe_instance allocations tracing: Fix to use __always_unused attribute compiler: Introduce __always_unused tracing: Exit with error if a weak function is used in recordmcount.pl tracing: Move conditional into update_funcs() in recordmcount.pl tracing: Add regex for weak functions in recordmcount.pl tracing: Move mcount section search to front of loop in recordmcount.pl tracing: Fix objcopy revision check in recordmcount.pl tracing: Check absolute path of input file in recordmcount.pl tracing: Correct the check for number of arguments in recordmcount.pl ...	2009-12-05 09:53:36 -08:00
Linus Torvalds	7a797cdcca	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix trace_marker output tracing: Fix event format export tracing: Fix return value of tracing_stats_read()	2009-12-05 09:53:21 -08:00
Linus Torvalds	bb2166c898	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix spurious irq seqfile conversion genirq: switch /proc/irq/*/spurious to seq_file irq: Do not attempt to create subdirectories if /proc/irq/<irq> failed irq: Remove unused debug_poll_all_shared_irqs() irq: Fix docbook comments irq: trivial: Fix typo in comment for #endif	2009-12-05 09:53:08 -08:00
Linus Torvalds	0bf7969fea	Merge branch 'core-softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softlockup: Fix hung_task_check_count sysctl	2009-12-05 09:52:46 -08:00
Linus Torvalds	69f061e0c2	Merge branch 'core-signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: signal: Print warning message when dropping signals signal: Fix alternate signal stack check	2009-12-05 09:52:33 -08:00
Linus Torvalds	607781762e	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits) rcu: Make RCU's CPU-stall detector be default rcu: Add expedited grace-period support for preemptible RCU rcu: Enable fourth level of TREE_RCU hierarchy rcu: Rename "quiet" functions rcu: Re-arrange code to reduce #ifdef pain rcu: Eliminate unneeded function wrapping rcu: Fix grace-period-stall bug on large systems with CPU hotplug rcu: Eliminate __rcu_pending() false positives rcu: Further cleanups of use of lastcomp rcu: Simplify association of forced quiescent states with grace periods rcu: Accelerate callback processing on CPUs not detecting GP end rcu: Mark init-time-only rcu_bootup_announce() as __init rcu: Simplify association of quiescent states with grace periods rcu: Rename dynticks_completed to completed_fqs rcu: Enable synchronize_sched_expedited() fastpath rcu: Remove inline from forward-referenced functions rcu: Fix note_new_gpnum() uses of ->gpnum rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter rcu: Cleanup: balance rcu_irq_enter()/rcu_irq_exit() calls ...	2009-12-05 09:52:14 -08:00
Linus Torvalds	d0b093a8b5	Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ratelimit: Make suppressed output messages more useful printk: Remove ratelimit.h from kernel.h ratelimit: Fix/allow use in atomic contexts ratelimit: Use per ratelimit context locking	2009-12-05 09:50:22 -08:00
Linus Torvalds	3e72b810e3	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mutex: Fix missing conditions to build mutex_spin_on_owner() mutex: Better control mutex adaptive spinning config locking, task_struct: Reduce size on TRACE_IRQFLAGS and 64bit locking: Use __[SPIN\|RW]_LOCK_UNLOCKED in [spin\|rw]_lock_init() locking: Remove unused prototype locking: Reduce ifdefs in kernel/spinlock.c locking: Make inlining decision Kconfig based	2009-12-05 09:49:59 -08:00
Linus Torvalds	9b269d4034	Merge branch 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: Add smp_call_function_any() generic-ipi: Fix misleading smp_call_function*() description	2009-12-05 09:49:46 -08:00
Louis Rilling	b69f229206	block: Fix io_context leak after failure of clone with CLONE_IO With CLONE_IO, parent's io_context->nr_tasks is incremented, but never decremented whenever copy_process() fails afterwards, which prevents exit_io_context() from calling IO schedulers exit functions. Give a task_struct to exit_io_context(), and call exit_io_context() instead of put_io_context() in copy_process() cleanup path. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:36:18 +01:00
André Goddard Rosa	af901ca181	tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re\|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over\|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:55 +01:00
Uwe Kleine-König	fbfecd3712	tree-wide: fix typos "couter" -> "counter" This patch was generated by git grep -E -i -l 'couter' \| xargs -r perl -p -i -e 's/couter/counter/' Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:51 +01:00
Liuweni	fb3d38b990	fix kerneldoc for set_irq_msi() Signed-off-by: Liuweni <qingshenlwy@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:50 +01:00
Patrick McHardy	8153a10c08	ipv4 05/05: add sysctl to accept packets with local source addresses commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8 Author: Patrick McHardy <kaber@trash.net> Date: Thu Dec 3 12:16:35 2009 +0100 ipv4: add sysctl to accept packets with local source addresses Change fib_validate_source() to accept packets with a local source address when the "accept_local" sysctl is set for the incoming inet device. Combined with the previous patches, this allows to communicate between multiple local interfaces over the wire. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-03 12:14:38 -08:00
Frederic Weisbecker	c08f782985	mutex: Fix missing conditions to build mutex_spin_on_owner() We don't need to build mutex_spin_on_owner() if we have CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as it won't be used under such configs. Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary checks before building it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>	2009-12-03 11:50:11 +01:00
Frederic Weisbecker	c02260277e	mutex: Better control mutex adaptive spinning config Introduce CONFIG_MUTEX_SPIN_ON_OWNER so that we can centralize in a single place the conditions that determine its definition and use. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1259783357-8542-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>	2009-12-03 11:50:11 +01:00
Paul E. McKenney	d9a3da0699	rcu: Add expedited grace-period support for preemptible RCU Implement an synchronize_rcu_expedited() for preemptible RCU that actually is expedited. This uses synchronize_sched_expedited() to force all threads currently running in a preemptible-RCU read-side critical section onto the appropriate ->blocked_tasks[] list, then takes a snapshot of all of these lists and waits for them to drain. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1259784616158-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-03 11:35:25 +01:00
Paul E. McKenney	cf244dc01b	rcu: Enable fourth level of TREE_RCU hierarchy Enable a fourth level of rcu_node hierarchy for TREE_RCU and TREE_PREEMPT_RCU. This is for stress-testing and experiemental purposes only, although in theory this would enable 16,777,216 CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit systems. Normal experimental use of this fourth level will normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system, though the more adventurous (and more fortunate) experimenters may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even CONFIG_RCU_FANOUT=4 for 256-CPU systems. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12597846161257-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-03 11:34:53 +01:00
Paul E. McKenney	d3f6bad391	rcu: Rename "quiet" functions The number of "quiet" functions has grown recently, and the names are no longer very descriptive. The point of all of these functions is to do some portion of the task of reporting a quiescent state, so rename them accordingly: o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a quiescent state to the per-CPU rcu_data structure. If this turns out to be a new quiescent state for this grace period, then rcu_report_qs_rnp() will be invoked to propagate the quiescent state up the rcu_node hierarchy. o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports a quiescent state for a given CPU (or possibly a set of CPUs) up the rcu_node hierarchy. o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which reports a full set of quiescent states to the global rcu_state structure. o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports a quiescent state due to a task exiting an RCU read-side critical section that had previously blocked in that same critical section. As indicated by the new name, this type of quiescent state is reported up the rcu_node hierarchy (using rcu_report_qs_rnp() to do so). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12597846163698-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-03 11:34:26 +01:00
Avi Kivity	58988b07cf	Merge remote branch 'tip/x86/entry' into kvm-updates/2.6.33 Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:30:06 +02:00
James Morris	c84d6efd36	Merge branch 'master' into next	2009-12-03 12:03:40 +05:30
Helge Deller	35dead4235	modules: don't export section names of empty sections via sysfs On the parisc architecture we face for each and every loaded kernel module this kernel "badness warning": sysfs: cannot create duplicate filename '/module/ac97_bus/sections/.text' Badness at fs/sysfs/dir.c:487 Reason for that is, that on parisc all kernel modules do have multiple .text sections due to the usage of the -ffunction-sections compiler flag which is needed to reach all jump targets on this platform. An objdump on such a kernel module gives: Sections: Idx Name Size VMA LMA File off Algn 0 .note.gnu.build-id 00000024 00000000 00000000 00000034 22 CONTENTS, ALLOC, LOAD, READONLY, DATA 1 .text 00000000 00000000 00000000 00000058 20 CONTENTS, ALLOC, LOAD, READONLY, CODE 2 .text.ac97_bus_match 0000001c 00000000 00000000 00000058 22 CONTENTS, ALLOC, LOAD, READONLY, CODE 3 .text 00000000 00000000 00000000 000000d4 20 CONTENTS, ALLOC, LOAD, READONLY, CODE ... Since the .text sections are empty (size of 0 bytes) and won't be loaded by the kernel module loader anyway, I don't see a reason why such sections need to be listed under /sys/module/<module_name>/sections/<section_name> either. The attached patch does solve this issue by not exporting section names which are empty. This fixes bugzilla http://bugzilla.kernel.org/show_bug.cgi?id=14703 Signed-off-by: Helge Deller <deller@gmx.de> CC: rusty@rustcorp.com.au CC: akpm@linux-foundation.org CC: James.Bottomley@HansenPartnership.com CC: roland@redhat.com CC: dave@hiauly1.hia.nrc.ca Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-02 15:38:25 -08:00
Hidetoshi Seto	0cf55e1ec0	sched, cputime: Introduce thread_group_times() This is a real fix for problem of utime/stime values decreasing described in the thread: http://lkml.org/lkml/2009/11/3/522 Now cputime is accounted in the following way: - {u,s}time in task_struct are increased every time when the thread is interrupted by a tick (timer interrupt). - When a thread exits, its {u,s}time are added to signal->{u,s}time, after adjusted by task_times(). - When all threads in a thread_group exits, accumulated {u,s}time (and also c{u,s}time) in signal struct are added to c{u,s}time in signal struct of the group's parent. So {u,s}time in task struct are "raw" tick count, while {u,s}time and c{u,s}time in signal struct are "adjusted" values. And accounted values are used by: - task_times(), to get cputime of a thread: This function returns adjusted values that originates from raw {u,s}time and scaled by sum_exec_runtime that accounted by CFS. - thread_group_cputime(), to get cputime of a thread group: This function returns sum of all {u,s}time of living threads in the group, plus {u,s}time in the signal struct that is sum of adjusted cputimes of all exited threads belonged to the group. The problem is the return value of thread_group_cputime(), because it is mixed sum of "raw" value and "adjusted" value: group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time) This misbehavior can break {u,s}time monotonicity. Assume that if there is a thread that have raw values greater than adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only runs 45ms) and if it exits, cputime will decrease (e.g. -5ms). To fix this, we could do: group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time) But task_times() contains hard divisions, so applying it for every thread should be avoided. This patch fixes the above problem in the following way: - Modify thread's exit (= __exit_signal()) not to use task_times(). It means {u,s}time in signal struct accumulates raw values instead of adjusted values. As the result it makes thread_group_cputime() to return pure sum of "raw" values. - Introduce a new function thread_group_times(task, utime, *stime) that converts "raw" values of thread_group_cputime() to "adjusted" values, in same calculation procedure as task_times(). - Modify group's exit (= wait_task_zombie()) to use this introduced thread_group_times(). It make c{u,s}time in signal struct to have adjusted values like before this patch. - Replace some thread_group_cputime() by thread_group_times(). This replacements are only applied where conveys the "adjusted" cputime to users, and where already uses task_times() near by it. (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.) This patch have a positive side effect: - Before this patch, if a group contains many short-life threads (e.g. runs 0.9ms and not interrupted by ticks), the group's cputime could be invisible since thread's cputime was accumulated after adjusted: imagine adjustment function as adj(ticks, runtime), {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0. After this patch it will not happen because the adjustment is applied after accumulated. v2: - remove if()s, put new variables into signal_struct. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Spencer Candland <spencer@bluehost.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4B162517.8040909@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 17:32:40 +01:00
Hidetoshi Seto	d99ca3b977	sched, cputime: Cleanups related to task_times() - Remove if({u,s}t)s because no one call it with NULL now. - Use cputime_{add,sub}(). - Add ifndef-endif for prev_{u,s}time since they are used only when !VIRT_CPU_ACCOUNTING. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Spencer Candland <spencer@bluehost.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4B1624C7.7040302@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 17:32:39 +01:00
Rusty Russell	bdddd2963c	sched: Fix isolcpus boot option Anton Blanchard wrote: > We allocate and zero cpu_isolated_map after the isolcpus > __setup option has run. This means cpu_isolated_map always > ends up empty and if CPUMASK_OFFSTACK is enabled we write to a > cpumask that hasn't been allocated. I introduced this regression in `49557e6203` (sched: Fix boot crash by zalloc()ing most of the cpu masks). Use the bootmem allocator if they set isolcpus=, otherwise allocate and zero like normal. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: peterz@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <200912021409.17013.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Anton Blanchard <anton@samba.org>	2009-12-02 10:27:16 +01:00
Avi Kivity	1786bf009f	core: Clean up user return notifers use of per_cpu Instead of using per_cpu(..., raw_smp_processor_id()), use __get_cpu_var(...). Signed-off-by: Avi Kivity <avi@redhat.com> LKML-Reference: <1259578491-4589-1-git-send-email-avi@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 10:22:59 +01:00
Tejun Heo	8592e6486a	sched: Revert `498657a478` `498657a478` incorrectly assumed that preempt wasn't disabled around context_switch() and thus was fixing imaginary problem. It also broke KVM because it depended on ->sched_in() to be called with irq enabled so that it can do smp calls from there. Revert the incorrect commit and add comment describing different contexts under with the two callbacks are invoked. Avi: spotted transposed in/out in the added comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Avi Kivity <avi@redhat.com> Cc: peterz@infradead.org Cc: efault@gmx.de Cc: rusty@rustcorp.com.au LKML-Reference: <1259726212-30259-2-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 09:55:33 +01:00
Randy Dunlap	595dd3d8bf	kmsg_dump: fix build for CONFIG_PRINTK=n kmsg_dump() fails to build when CONFIG_PRINTK=n; provide stubs for the kmsg_dump*() functions when CONFIG_PRINTK=n. kernel/printk.c: In function 'kmsg_dump': kernel/printk.c:1501: error: 'log_buf_len' undeclared (first use in this function) kernel/printk.c:1502: error: 'logged_chars' undeclared (first use in this function) kernel/printk.c:1506: error: 'log_buf' undeclared (first use in this function) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-12-02 08:44:33 +00:00
Kristian Høgsberg	ec70ccd806	perf: Don't free perf_mmap_data until work has been done In the CONFIG_PERF_USE_VMALLOC case, perf_mmap_data_free() only schedules the cleanup of the perf_mmap_data struct. In that case we have to wait until the work has been done before we free data. Signed-off-by: Kristian Høgsberg <krh@bitplanet.net> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1259697901-1747-1-git-send-email-krh@bitplanet.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 09:30:18 +01:00
David S. Miller	ff9c38bba3	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/mac80211/ht.c	2009-12-01 22:13:38 -08:00
Lai Jiangshan	7be077f563	trace_syscalls: Remove unused syscall_name_to_nr() After duplications are removed, syscall_name_to_nr() is unused. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D2A6.6060803@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:31 +01:00
Lai Jiangshan	3bbe84e9d3	trace_syscalls: Simplify syscall profile use only one prof_sysenter_enable() instead of prof_sysenter_enable_##sname() use only one prof_sysenter_disable() instead of prof_sysenter_disable_##sname() use only one prof_sysexit_enable() instead of prof_sysexit_enable_##sname() use only one prof_sysexit_disable() instead of prof_sysexit_disable_##sname() Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D2A1.8060304@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:30 +01:00
Lai Jiangshan	a1301da099	trace_syscalls: Remove duplicate init_enter_##sname() use only one init_syscall_trace instead of many init_enter_##sname()/init_exit_##sname() Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D29B.6090708@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:30 +01:00
Lai Jiangshan	c252f65793	trace_syscalls: Add syscall_nr field to struct syscall_metadata Add syscall_nr field to struct syscall_metadata, it helps us to get syscall number easier. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D293.6090800@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:29 +01:00
Lai Jiangshan	fcc19438dd	trace_syscalls: Remove enter_id exit_id use ->enter_event->id instead of ->enter_id use ->exit_event->id instead of ->exit_id Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D288.7030001@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:29 +01:00
Lai Jiangshan	31c16b1334	trace_syscalls: Set event_enter_##sname->data to its metadata Set event_enter_##sname->data to its metadata, it makes codes simpler. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D282.7050709@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:28 +01:00
Lai Jiangshan	bf56a4ea9f	trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit fix event_enter_##sname->event fix event_exit_##sname->event remove unused event_syscall_enter and event_syscall_exit Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D278.4090209@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:28 +01:00
David Howells	f13a48bd79	SLOW_WORK: Move slow_work's proc file to debugfs Move slow_work's debugging proc file to debugfs. Signed-off-by: David Howells <dhowells@redhat.com> Requested-and-acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-01 08:20:31 -08:00
David Howells	fa1dae4906	SLOW_WORK: Fix the CONFIG_MODULES=n case Commits `3d7a641` ("SLOW_WORK: Wait for outstanding work items belonging to a module to clear") introduced some code to make sure that all of a module's slow-work items were complete before that module was removed, and commit `3bde31a` ("SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed") further extended that, breaking it in the process if CONFIG_MODULES=n: CC kernel/slow-work.o kernel/slow-work.c: In function 'slow_work_execute': kernel/slow-work.c:313: error: 'slow_work_thread_processing' undeclared (first use in this function) kernel/slow-work.c:313: error: (Each undeclared identifier is reported only once kernel/slow-work.c:313: error: for each function it appears in.) kernel/slow-work.c: In function 'slow_work_wait_for_items': kernel/slow-work.c:950: error: 'slow_work_unreg_sync_lock' undeclared (first use in this function) kernel/slow-work.c:951: error: 'slow_work_unreg_wq' undeclared (first use in this function) kernel/slow-work.c:961: error: 'slow_work_unreg_work_item' undeclared (first use in this function) kernel/slow-work.c:974: error: 'slow_work_unreg_module' undeclared (first use in this function) kernel/slow-work.c:977: error: 'slow_work_thread_processing' undeclared (first use in this function) make[1]: *** [kernel/slow-work.o] Error 1 Fix this by: (1) Extracting the bits of slow_work_execute() that are contingent on CONFIG_MODULES, and the bits that should be, into inline functions and placing them into the #ifdef'd section that defines the relevant variables and adding stubs for moduleless kernels. This allows the removal of some #ifdefs. (2) #ifdef'ing out the contents of slow_work_wait_for_items() in moduleless kernels. The four functions related to handling module unloading synchronisation (and their associated variables) could be offloaded into a separate .c file, but each function is only used once and three of them are tiny, so doing so would prevent them from being inlined. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-01 07:35:11 -08:00
Xiao Guangrong	59d069eb5a	perf_event: Initialize data.period in perf_swevent_hrtimer() In current code in perf_swevent_hrtimer(), data.period is not initialized, The result is obvious wrong: # ./perf record -f -e cpu-clock make # ./perf report # Samples: 1740 # # Overhead Command ...... # ........ ........ .......................................... # 1025422183050275328.00% sh libc-2.9.90.so ... 1025422183050275328.00% perl libperl.so ... 1025422168240043264.00% perl [kernel] ... 1025422030011210752.00% perl [kernel] ... Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <4B14E220.2050107@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 11:19:07 +01:00
Masami Hiramatsu	ba8665d7dd	trace_kprobes: Fix a memory leak bug and check kstrdup() return value Fix a memory leak case in create_trace_probe(). When an argument is too long (> MAX_ARGSTR_LEN), it just jumps to error path. In that case tp->args[i].name is not released. This also fixes a bug to check kstrdup()'s return value. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091201001919.10235.56455.stgit@harusame> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 08:19:59 +01:00
Simon Kagstrom	456b565cc5	core: Add kernel message dumper to call on oopses and panics The core functionality is implemented as per Linus suggestion from http://lists.infradead.org/pipermail/linux-mtd/2009-October/027620.html (with the kmsg_dump implementation by Linus). A struct kmsg_dumper has been added which contains a callback to dump the kernel log buffers on crashes. The kmsg_dump function gets called from oops_exit() and panic() and invokes this callbacks with the crash reason. [dwmw2: Fix log_end handling] Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Reviewed-by: Anders Grafstrom <anders.grafstrom@netinsight.net> Reviewed-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-11-30 12:01:49 +00:00
Avi Kivity	8e7cac7980	core: Fix user return notifier on fork() fork() clones all thread_info flags, including TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu which doesn't have user return notifiers set, this causes user return notifiers to trigger without any way of clearing itself. This is easy to trigger with a forky workload on the host in parallel with kvm, resulting in a cpu in an endless loop on the verge of returning to userspace. Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork. Signed-off-by: Avi Kivity <avi@redhat.com> LKML-Reference: <1259505288-16559-1-git-send-email-avi@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-29 22:03:04 +01:00
Lai Jiangshan	52a11f3549	trace_kprobes: Don't output zero offset "symbol_name+0" is not so friendly. It makes the output longer. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0CEBCB.7080309@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:43:05 +01:00
Lai Jiangshan	3d9b2e1ddf	trace_kprobes: Always show group name Sometimes the group name is not "kprobes", It'll be better if we can read it from tracing/kprobe_events. # echo 'r:laijs/vfs_read vfs_read %ax' > kprobe_events # cat kprobe_events r:laijs/vfs_read vfs_read %ax=%ax Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0CEBAF.6000104@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:43:04 +01:00
Lai Jiangshan	abab9d37d2	trace_kprobes: Fix memory leak tp->nr_args is not set before we "goto error", it causes memory leak for free_trace_probe() use tp->nr_args to free memory of args. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0CEB95.2060107@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:43:04 +01:00
Lai Jiangshan	0f1ef51d24	trace_syscalls: Add syscall nr field Field syscall number is missed in syscall_enter_define_fields()/ syscall_exit_define_fields(). Syscall number is also needed for event filter or other users. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B0E330D.1070206@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:24:19 +01:00
Frederic Weisbecker	dd1853c3f4	hw-breakpoints: Use struct perf_event_attr to define kernel breakpoints Kernel breakpoints are created using functions in which we pass breakpoint parameters as individual variables: address, length and type. Although it fits well for x86, this just does not scale across architectures that may support this api later as these may have more or different needs. Pass in a perf_event_attr structure instead because it is meant to evolve as much as possible into a generic hardware breakpoint parameter structure. Reported-by: K.Prasad <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259294154-5197-2-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:22:59 +01:00
Frederic Weisbecker	5fa10b28e5	hw-breakpoints: Use struct perf_event_attr to define user breakpoints In-kernel user breakpoints are created using functions in which we pass breakpoint parameters as individual variables: address, length and type. Although it fits well for x86, this just does not scale across archictectures that may support this api later as these may have more or different needs. Pass in a perf_event_attr structure instead because it is meant to evolve as much as possible into a generic hardware breakpoint parameter structure. Reported-by: K.Prasad <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259294154-5197-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:22:58 +01:00
Anton Blanchard	e5af022616	softlockup: Fix hung_task_check_count sysctl I'm seeing spikes of up to 0.5ms in khungtaskd on a large machine. To reduce this source of jitter I tried setting hung_task_check_count to 0: # echo 0 > /proc/sys/kernel/hung_task_check_count which didn't have the intended response. Change to a post increment of max_count, so a value of 0 means check 0 tasks. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: msb@google.com LKML-Reference: <20091127022820.GU32182@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:21:57 +01:00
Stephane Eranian	b2e74a265d	perf_events: Fix read() bogus counts when in error state When a pinned group cannot be scheduled it goes into error state. Normally a group cannot go out of error state without being explicitly re-enabled or disabled. There was a bug in per-thread mode, whereby upon termination of the thread, the group would transition from error to off leading to bogus counts and timing information returned by read(). Fix it by clearing the error state. Signed-off-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: perfmon2-devel@lists.sourceforge.net LKML-Reference: <4b0eb9ce.0508d00a.573b.ffffeab6@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 18:49:59 +01:00
Hidetoshi Seto	b7b20df91d	sched, time: Define nsecs_to_jiffies() Use of msecs_to_jiffies() for nsecs_to_cputime() have some problems: - The type of msecs_to_jiffies()'s argument is unsigned int, so it cannot convert msecs greater than UINT_MAX = about 49.7 days. - msecs_to_jiffies() returns MAX_JIFFY_OFFSET if MSB of argument is set, assuming that input was negative value. So it cannot convert msecs greater than INT_MAX = about 24.8 days too. This patch defines a new function nsecs_to_jiffies() that can deal greater values, and that can deal all incoming values as unsigned. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Amrico Wang <xiyou.wangcong@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@linux.vnet.ibm.com> LKML-Reference: <4B0E16E7.5070307@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 12:59:20 +01:00
Hidetoshi Seto	d5b7c78e97	sched: Remove task_{u,s,g}time() Now all task_{u,s}time() pairs are replaced by task_times(). And task_gtime() is too simple to be an inline function. Cleanup them all. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> LKML-Reference: <4B0E16D1.70902@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 12:59:20 +01:00
Hidetoshi Seto	d180c5bcce	sched: Introduce task_times() to replace task_{u,s}time() pair Functions task_{u,s}time() are called in pair in almost all cases. However task_stime() is implemented to call task_utime() from its inside, so such paired calls run task_utime() twice. It means we do heavy divisions (div_u64 + do_div) twice to get utime and stime which can be obtained at same time by one set of divisions. This patch introduces a function task_times(tsk, utime, *stime) to retrieve utime and stime at once in better, optimized way. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> LKML-Reference: <4B0E16AE.906@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 12:59:19 +01:00
Masami Hiramatsu	ba005e1f41	tracepoint: Add signal loss events Add signal_overflow_fail and signal_lose_info tracepoints for signal-lost events. Changes in v3: - Add docbook style comments Changes in v2: - Use siginfo string macro Suggested-by: Roland McGrath <roland@redhat.com> Reviewed-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <20091124215658.30449.9934.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:55:38 +01:00
Masami Hiramatsu	f9d4257e01	tracepoint: Add signal deliver event Add a tracepoint where a process gets a signal. This tracepoint shows signal-number, sa-handler and sa-flag. Changes in v3: - Add docbook style comments Changes in v2: - Add siginfo argument - Fix comment Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Reviewed-by: Jason Baron <jbaron@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <20091124215651.30449.20926.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:55:38 +01:00
Masami Hiramatsu	d1eb650ff4	tracepoint: Move signal sending tracepoint to events/signal.h Move signal sending event to events/signal.h. This patch also renames sched_signal_send event to signal_generate. Changes in v4: - Fix a typo of task_struct pointer. Changes in v3: - Add docbook style comments Changes in v2: - Add siginfo argument - Add siginfo storing macro Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Reviewed-by: Jason Baron <jbaron@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <20091124215645.30449.60208.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:55:37 +01:00
Ingo Molnar	16bc67edeb	Merge branch 'sched/urgent' into sched/core Merge reason: Pick up fixes that did not make it into .32.0 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:50:42 +01:00
Frederic Weisbecker	80bbf6b641	hw-breakpoints: Fix unused function in off-case bp_perf_event_destroy() is unused in its off-case version, let's remove it to fix the following warning reported by Stephen Rothwell in linux-next: kernel/perf_event.c:4306: warning: 'bp_perf_event_destroy' defined but not used Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1259180453-5813-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:40:51 +01:00
Mike Travis	feae3203d7	timers, init: Limit the number of per cpu calibration bootup messages Limit the number of per cpu calibration messages by only printing out results for the first cpu to boot. Also, don't print "CPUx is down" as this is expected, and we don't need 4096 reminders... ;-) Signed-off-by: Mike Travis <travis@sgi.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091118002219.889552000@alcatraz.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:18:42 +01:00
Mike Travis	f6630114d9	sched: Limit the number of scheduler debug messages Remove the verbose scheduler debug messages unless kernel parameter "sched_debug" set. /proc/sched_debug unchanged. Signed-off-by: Mike Travis <travis@sgi.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091118002221.489305000@alcatraz.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:17:30 +01:00
Andrew Morton	11e6635763	kernel/hw_breakpoint.c: Fix local/global shadowing If the new percpu tree is combined with the perf events tree the following new warning triggers: kernel/hw_breakpoint.c: In function 'toggle_bp_task_slot': kernel/hw_breakpoint.c:151: warning: 'task_bp_pinned' is used uninitialized in this function Because it's not valid anymore to define a local variable and a percpu variable (even if it's file scope local) with the same name. Rename the local variable to resolve this. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <200911260701.nAQ71owx016356@imap1.linux-foundation.org> [ v2: added changelog ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:34:04 +01:00
Frederic Weisbecker	605bfaee90	hw-breakpoints: Simplify error handling in breakpoint creation requests This simplifies the error handling when we create a breakpoint. We don't need to check the NULL return value corner case anymore since we have improved perf_event_create_kernel_counter() to always return an error code in the failure case. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1259210142-5714-3-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:29:21 +01:00
Frederic Weisbecker	c6567f642e	hw-breakpoints: Improve in-kernel event creation error granularity In fail case, perf_event_create_kernel_counter() returns NULL instead of an error, which doesn't help us to inform the user about the origin of the problem from the outer most callers. Often we can just return -EINVAL, which doesn't help anyone when it's eventually about a memory allocation failure. Then, this patch makes perf_event_create_kernel_counter() always return a detailed error code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1259210142-5714-2-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:29:21 +01:00
Frederic Weisbecker	d99be40aff	ksym_tracer: Fix breakpoint removal after modification The error path of a breakpoint modification is broken in the ksym tracer. A modified breakpoint hlist node is immediately released after its removal. Also we leak a breakpoint in this case. Fix the path. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1259210142-5714-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:29:20 +01:00
Steven Rostedt	7ac0743404	ring-buffer-benchmark: Add parameters to set produce/consumer priorities Running the ring-buffer-benchmark's threads at the lowest priority may work well for keeping it in the background, but it is not appropriate for the benchmarks. This patch adds 4 parameters to the module: consumer_fifo consumer_nice producer_fifo producer_nice By default the consumer and producer still run at nice +19. If the _fifo options are set, they will override the _nice values. modprobe ring_buffer_benchmark consumer_nice=0 producer_fifo=10 The above will set the consumer thread to a nice value of 0, and the producer thread to a RT SCHED_FIFO priority of 10. Note, this patch also fixes a bug where calling set_user_nice on the consumer thread would oops the kernel when the parameter "disable_reader" is set. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-25 14:14:15 -05:00
Shmulik Ladkani	93335a2155	sched.c: Call debug_show_all_locks() when dumping all tasks In commit v2.6.21-691-g39bc89f ("make SysRq-T show all tasks again") the interface of show_state_filter() was changed: zero valued 'state_filter' specifies "dump all tasks" (instead of -1). However, the condition for calling debug_show_all_locks() ("show locks if all tasks are dumped") was not updated accordingly. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: peterz@infradead.org LKML-Reference: <4b0d2fe4.0ab6660a.6437.3cfc@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-25 14:26:52 +01:00
Tom Zanussi	99df5a6a21	trace/syscalls: Change ret param in struct syscall_trace_exit to long Commit `ee949a86b3` ("tracing/syscalls: Use long for syscall ret format and field definitions") changed the syscall exit return type to long, but forgot to change it in the struct. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259133299-23594-3-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-25 09:06:10 +01:00
Frederic Weisbecker	fe61267227	perf_events: Fix bad software/trace event recursion counting Commit `4ed7c92d68` (perf_events: Undo some recursion damage) has introduced a bad reference counting of the recursion context. putting the context behaves like getting it, dropping every software/trace events after the first one in a context. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1259091502-5171-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 21:34:00 +01:00
Tim Blechmann	710390d90f	sched: Optimize branch hint in context_switch() Branch hint profiling on my nehalem machine showed over 90% incorrect branch hints: 10420275 170645395 94 context_switch sched.c 3043 10408421 171098521 94 context_switch sched.c 3050 Signed-off-by: Tim Blechmann <tim@klingt.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0BBB9F.6080304@klingt.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 12:18:42 +01:00
Tim Blechmann	36ace27e3e	sched: Optimize branch hint in pick_next_task_fair() Branch hint profiling on my nehalem machine showed 90% incorrect branch hints: 15728471 158903754 90 pick_next_task_fair sched_fair.c 1555 Signed-off-by: Tim Blechmann <tim@klingt.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0BBBB1.2050100@klingt.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 12:18:12 +01:00
Stephane Eranian	184d3da8ef	perf_events: Fix bogus copy_to_user() in perf_event_read_group() When using an event group, the value and id for non leaders events were wrong due to invalid offset into the outgoing buffer. Signed-off-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: paulus@samba.org Cc: perfmon2-devel@lists.sourceforge.net LKML-Reference: <4b0b71e1.0508d00a.075e.ffff84a3@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 08:55:27 +01:00
Serge E. Hallyn	b3a222e52e	remove CONFIG_SECURITY_FILE_CAPABILITIES compile option As far as I know, all distros currently ship kernels with default CONFIG_SECURITY_FILE_CAPABILITIES=y. Since having the option on leaves a 'no_file_caps' option to boot without file capabilities, the main reason to keep the option is that turning it off saves you (on my s390x partition) 5k. In particular, vmlinux sizes came to: without patch fscaps=n: 53598392 without patch fscaps=y: 53603406 with this patch applied: 53603342 with the security-next tree. Against this we must weigh the fact that there is no simple way for userspace to figure out whether file capabilities are supported, while things like per-process securebits, capability bounding sets, and adding bits to pI if CAP_SETPCAP is in pE are not supported with SECURITY_FILE_CAPABILITIES=n, leaving a bit of a problem for applications wanting to know whether they can use them and/or why something failed. It also adds another subtly different set of semantics which we must maintain at the risk of severe security regressions. So this patch removes the SECURITY_FILE_CAPABILITIES compile option. It drops the kernel size by about 50k over the stock SECURITY_FILE_CAPABILITIES=y kernel, by removing the cap_limit_ptraced_target() function. Changelog: Nov 20: remove cap_limit_ptraced_target() as it's logic was ifndef'ed. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan" <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>	2009-11-24 15:06:47 +11:00
Andrew G. Morgan	c4a5af54c8	Silence the existing API for capability version compatibility check. When libcap, or other libraries attempt to confirm/determine the supported capability version magic, they generally supply a NULL dataptr to capget(). In this case, while returning the supported/preferred magic (via a modified header content), the return code of this system call may be 0, -EINVAL, or -EFAULT. No libcap code depends on the previous -EINVAL etc. return code, and all of the above three return codes can accompany a valid (successful) attempt to determine the requested magic value. This patch cleans up the system call to return 0, if the call is successfully being used to determine the supported/preferred capability magic value. Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Steve Grubb <sgrubb@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-11-24 08:53:29 +11:00
Jan Blunck	429947248f	sched_feat_write(): Update ppos instead of file->f_pos sched_feat_write() should update ppos instead of file->f_pos. (This reduces some BKL dependencies of this code.) Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: jkacur@redhat.com Cc: Arnd Bergmann <arnd@arndb.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jamie Lokier <jamie@shareable.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> LKML-Reference: <1258735245-25826-8-git-send-email-jblunck@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 19:38:03 +01:00
Frederic Weisbecker	f5ffe02e50	perf: Add kernel side syscall events support for breakpoints Add the remaining necessary bits to support breakpoints created through perf syscall. We don't use the software counter interface as: - We don't need to check against recursion, this is already done in hardware breakpoints arch level. - We already know the perf event we are dealing with when the event is to be committed. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1258987355-8751-3-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 18:18:31 +01:00
Frederic Weisbecker	fdf6bc9522	hw-breakpoints: Check the breakpoint params from perf tools Perf tools create perf events as disabled in the beginning. Breakpoints are then considered like ptrace temporary breakpoints, only meant to reserve a breakpoint slot until we get all the necessary informations from the user. In this case, we don't check the address that is breakpointed as it is NULL in the ptrace case. But perf tools don't have the same purpose, events are created disabled to wait for all events to be created before enabling all of them. We want to check the breakpoint parameters in this case. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1258987355-8751-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 18:18:30 +01:00
K.Prasad	ba6909b719	hw-breakpoint: Attribute authorship of hw-breakpoint related files Attribute authorship to developers of hw-breakpoint related files. Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123154713.GA5593@in.ibm.com> [ v2: moved it to latest -tip ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 18:18:29 +01:00
Peter Zijlstra	acd1d7c1f8	perf_events: Restore sanity to scaling land It is quite possible to call update_event_times() on a context that isn't actually running and thereby confuse the thing. perf stat was reporting !100% scale values for software counters (`2e2af50b` perf_events: Disable events when we detach them, solved the worst of that, but there was still some left). The thing that happens is that because we are not self-reaping (we have a caring parent) there is a time between the last schedule (out) and having do_exit() called which will detach the events. This period would be accounted as enabled,!running because the event->state==INACTIVE, even though !event->ctx->is_active. Similar issues could have been observed by calling read() on a event while the attached task was not scheduled in. Solve this by teaching update_event_times() about ctx->is_active. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1258984836.4531.480.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 15:22:19 +01:00
Peter Zijlstra	4ed7c92d68	perf_events: Undo some recursion damage Make perf_swevent_get_recursion_context return a context number and disable preemption. This could be used to remove the IRQ disable from the trace bit and index the per-cpu buffer with. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091123103819.993226816@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:57 +01:00
Peter Zijlstra	f67218c3e9	perf_events: Fix __perf_event_exit_task() vs. update_event_times() locking Move the update_event_times() call in __perf_event_exit_task() into list_del_event() because that holds the proper lock (ctx->lock) and seems a more natural place to do the last time update. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.842455480@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:57 +01:00
Peter Zijlstra	5e942bb333	perf_events: Update the context time on exit It appeared we did call update_event_times() on exit, but we failed to update the context time, which renders the former moot. Locking is a bit iffy, we call update_event_times under ctx->mutex instead of ctx->lock - the next patch fixes this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.764207355@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:56 +01:00
Peter Zijlstra	2e2af50b1f	perf_events: Disable events when we detach them If we leave the event in STATE_INACTIVE, any read of the event after the detach will increase the running count but not the enabled count and cause funny scaling artefacts. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.689055515@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:56 +01:00
Peter Zijlstra	6c2bfcbe58	perf_events: Fix style nits Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.613427378@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:55 +01:00
Peter Zijlstra	a66a3052e2	perf_events: Undo copy/paste damage We had two almost identical functions, avoid the duplication. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20091123103819.537537928@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:55 +01:00
Ingo Molnar	a4234bfcf4	perf_events: Optimize the swcounter hotpath The structure init creates a bit memcpy, which shows up big time in perf annotate output: : ffffffff810a859d <__perf_sw_event>: 1.68 : ffffffff810a859d: 55 push %rbp 1.69 : ffffffff810a859e: 41 89 fa mov %edi,%r10d 0.01 : ffffffff810a85a1: 49 89 c9 mov %rcx,%r9 0.00 : ffffffff810a85a4: 31 c0 xor %eax,%eax 1.71 : ffffffff810a85a6: b9 16 00 00 00 mov $0x16,%ecx 0.00 : ffffffff810a85ab: 48 89 e5 mov %rsp,%rbp 0.00 : ffffffff810a85ae: 48 83 ec 60 sub $0x60,%rsp 1.52 : ffffffff810a85b2: 48 8d 7d a0 lea -0x60(%rbp),%rdi 85.20 : ffffffff810a85b6: f3 ab rep stos %eax,%es:(%rdi) None of the callees depends on the structure being pre-initialized, so only initialize ->addr. This gets rid of the memcpy overhead. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:48:27 +01:00
Ingo Molnar	457dc928f5	tracing, function tracer: Clean up strstrip() usage Clean up strstrip() usage - which also addresses this build warning: kernel/trace/ftrace.c: In function 'ftrace_pid_write': kernel/trace/ftrace.c:3004: warning: ignoring return value of 'strstrip', declared with attribute warn_unused_result Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:04:07 +01:00
Ingo Molnar	6e3d8330ae	perf events: Do not generate function trace entries in perf code Decreases perf overhead when function tracing is enabled, by about 50%. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 10:19:20 +01:00
Ingo Molnar	98e4833ba3	ring-buffer benchmark: Run producer/consumer threads at nice +19 The ring-buffer benchmark threads run on nice 0 by default, using up a lot of CPU time and slowing down the system: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1024 root 20 0 0 0 0 D 95.3 0.0 4:01.67 rb_producer 1023 root 20 0 0 0 0 R 93.5 0.0 2:54.33 rb_consumer 21569 mingo 40 0 14852 1048 772 R 3.6 0.1 0:00.05 top 1 root 40 0 4080 928 668 S 0.0 0.0 0:23.98 init Renice them to +19 to make them less intrusive. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 08:03:09 +01:00
Paul E. McKenney	6ebb237bec	rcu: Re-arrange code to reduce #ifdef pain Remove #ifdefs from kernel/rcupdate.c and include/linux/rcupdate.h by moving code to include/linux/rcutiny.h, include/linux/rcutree.h, and kernel/rcutree.c. Also remove some definitions that are no longer used. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1258908830885-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 18:58:16 +01:00
Paul E. McKenney	9f680ab414	rcu: Eliminate unneeded function wrapping The functions rcu_init() is a wrapper for __rcu_init(), and also sets up the CPU-hotplug notifier for rcu_barrier_cpu_hotplug(). But TINY_RCU doesn't need CPU-hotplug notification, and the rcu_barrier_cpu_hotplug() is a simple wrapper for rcu_cpu_notify(). So push rcu_init() out to kernel/rcutree.c and kernel/rcutiny.c and get rid of the wrapper function rcu_barrier_cpu_hotplug(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12589088302320-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 18:58:16 +01:00
Paul E. McKenney	b668c9cf3e	rcu: Fix grace-period-stall bug on large systems with CPU hotplug When the last CPU of a given leaf rcu_node structure goes offline, all of the tasks queued on that leaf rcu_node structure (due to having blocked in their current RCU read-side critical sections) are requeued onto the root rcu_node structure. This requeuing is carried out by rcu_preempt_offline_tasks(). However, it is possible that these queued tasks are the only thing preventing the leaf rcu_node structure from reporting a quiescent state up the rcu_node hierarchy. Unfortunately, the old code would fail to do this reporting, resulting in a grace-period stall given the following sequence of events: 1. Kernel built for more than 32 CPUs on 32-bit systems or for more than 64 CPUs on 64-bit systems, so that there is more than one rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set to a number smaller than CONFIG_NR_CPUS.) 2. The kernel is built with CONFIG_TREE_PREEMPT_RCU. 3. A task running on a CPU associated with a given leaf rcu_node structure blocks while in an RCU read-side critical section -and- that CPU has not yet passed through a quiescent state for the current RCU grace period. This will cause the task to be queued on the leaf rcu_node's blocked_tasks[] array, in particular, on the element of this array corresponding to the current grace period. 4. Each of the remaining CPUs corresponding to this same leaf rcu_node structure pass through a quiescent state. However, the task is still in its RCU read-side critical section, so these quiescent states cannot be reported further up the rcu_node hierarchy. Nevertheless, all bits in the leaf rcu_node structure's ->qsmask field are now zero. 5. Each of the remaining CPUs go offline. (The events in step #4 and #5 can happen in any order as long as each CPU passes through a quiescent state before going offline.) 6. When the last CPU goes offline, __rcu_offline_cpu() will invoke rcu_preempt_offline_tasks(), which will move the task to the root rcu_node structure, but without reporting a quiescent state up the rcu_node hierarchy (and this failure to report a quiescent state is the bug). But because this leaf rcu_node structure's ->qsmask field is already zero and its ->block_tasks[] entries are all empty, force_quiescent_state() will skip this rcu_node structure. Therefore, grace periods are now hung. This patch abstracts some code out of rcu_read_unlock_special(), calling the result task_quiet() by analogy with cpu_quiet(), and invokes task_quiet() from both rcu_read_lock_special() and __rcu_offline_cpu(). Invoking task_quiet() from __rcu_offline_cpu() reports the quiescent state up the rcu_node hierarchy, fixing the bug. This ends up requiring a separate lock_class_key per level of the rcu_node hierarchy, which this patch also provides. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12589088301770-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 18:58:15 +01:00
Ingo Molnar	645e8cc0c9	perf_events: Fix modular build Fix: ERROR: "perf_swevent_put_recursion_context" [fs/ext4/ext4.ko] undefined! ERROR: "perf_swevent_get_recursion_context" [fs/ext4/ext4.ko] undefined! Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1258864015-10579-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 12:21:33 +01:00
Márton Németh	96b02d78a7	perf_event: Remove redundant zero fill The buffer is first zeroed out by memset(). Then strncpy() is used to fill the content. The strncpy() function also pads the string till the end of the specified length, which is redundant. The strncpy() does not ensures that the string will be properly closed with 0. Use strlcpy() instead. The semantic match that finds this kind of pattern is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression buffer; expression size; expression str; @@ memset(buffer, 0, size); ... - strncpy( + strlcpy( buffer, str, sizeof(buffer) ); @@ expression buffer; expression size; expression str; @@ memset(&buffer, 0, size); ... - strncpy( + strlcpy( &buffer, str, sizeof(buffer)); @@ expression buffer; identifier field; expression size; expression str; @@ memset(buffer, 0, size); ... - strncpy( + strlcpy( buffer->field, str, sizeof(buffer->field) ); @@ expression buffer; identifier field; expression size; expression str; @@ memset(&buffer, 0, size); ... - strncpy( + strlcpy( buffer.field, str, sizeof(buffer.field)); // </smpl> On strncpy() vs strlcpy() see http://www.gratisoft.us/todd/papers/strlcpy.html . Signed-off-by: Márton Németh <nm127@freemail.hu> Cc: Julia Lawall <julia@diku.dk> Cc: cocci@diku.dk Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B086547.5040100@freemail.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:49:26 +01:00
Frederic Weisbecker	b3a75542d3	hw-breakpoints: Remove x86 specific headers from core file Remove asm/processor.h and asm/debugreg.h as these headers are not used anymore in the hw-breakpoints core file. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1258863695-10464-3-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:03:43 +01:00
Frederic Weisbecker	28889bf9e2	tracing: Forget about the NMI buffer for syscall events We are never in an NMI context when we commit a syscall trace to perf. So just forget about the nmi buffer there. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1258863695-10464-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:03:42 +01:00
Frederic Weisbecker	ce71b9df88	tracing: Use the perf recursion protection from trace event When we commit a trace to perf, we first check if we are recursing in the same buffer so that we don't mess-up the buffer with a recursing trace. But later on, we do the same check from perf to avoid commit recursion. The recursion check is desired early before we touch the buffer but we want to do this check only once. Then export the recursion protection from perf and use it from the trace events before submitting a trace. v2: Put appropriate Reported-by tag Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1258864015-10579-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:03:42 +01:00
Stephane Eranian	8904b18046	perf_events: Fix default watermark calculation This patch fixes the default watermark value for the sampling buffer. With the existing calculation (watermark = max(PAGE_SIZE, max_size / 2)), no notification was ever received when the buffer was exactly 1 page. This was because you would never cross the threshold (there is no partial samples). In certain configuration, there was no possibilty detecting the problem because there was not enough space left to store the LOST record.In fact, there may be a more generic problem here. The kernel should ensure that there is alaways enough space to store one LOST record. This patch sets the default watermark to half the buffer size. With such limit, we are guaranteed to get a notification even with a single page buffer assuming no sample is bigger than a page. Signed-off-by: Stephane Eranian <eranian@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.344964101@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1256302576-6169-1-git-send-email-eranian@gmail.com>	2009-11-21 14:11:41 +01:00
Peter Zijlstra	6f10581aea	perf: Fix locking for PERF_FORMAT_GROUP We should hold event->child_mutex when iterating the inherited counters, we should hold ctx->mutex when iterating siblings. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.251030114@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:40 +01:00
Peter Zijlstra	59ed446f79	perf: Fix event scaling for inherited counters Properly account the full hierarchy of counters for both the count (we already did so) and the scale times (new). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.153379276@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:40 +01:00
Peter Zijlstra	2b8988c9f7	perf: Fix time locking Most sites updating ctx->time and event times do so under ctx->lock, make sure they all do. This was made possible by removing the __perf_event_read() call from __perf_event_sync_stat(), which already had this lock taken. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.102316434@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:39 +01:00
Peter Zijlstra	58e5ad1de3	perf: Simplify __perf_event_read cpuctx is always active, task context is always active for current the previous condition verifies that if its a task context its for current, hence we can assume ctx->is_active. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.000272254@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:39 +01:00
Peter Zijlstra	3dbebf15c5	perf: Simplify __perf_event_sync_stat Removes constraints from __perf_event_read() by leaving it with a single callsite; this callsite had ctx->lock held, the other one does not. Removes some superfluous code from __perf_event_sync_stat(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.918544317@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:39 +01:00
Peter Zijlstra	f6f8378522	perf: Optimize __perf_event_read() Both callers actually have IRQs disabled, no need doing so again. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.863685796@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:38 +01:00
Peter Zijlstra	02ffdbc866	perf: Optimize perf_event_task_sched_out Remove an update_context_time() call from the perf_event_task_sched_out() path and into the branch its needed. The call was both superfluous, because __perf_event_sched_out() already does it, and wrong, because it was done without holding ctx->lock. Place it in perf_event_sync_stat(), which is the only place it is needed and which does already hold ctx->lock. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.779516394@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:38 +01:00
Peter Zijlstra	abf4868b85	perf: Fix PERF_FORMAT_GROUP scale info As Corey reported, the total_enabled and total_running times could occasionally be 0, even though there were events counted. It turns out this is because we record the times before reading the counter while the latter updates the times. This patch corrects that. While looking at this code I found that there is a lot of locking iffyness around, the following patches correct most of that. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.685559857@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:37 +01:00
Peter Zijlstra	f6d9dd237d	perf: Optimize perf_event_mmap_ctx() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.606459548@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:37 +01:00
Peter Zijlstra	f6595f3a96	perf: Optimize perf_event_comm_ctx() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.527608793@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:36 +01:00
Peter Zijlstra	d6ff86cfb5	perf: Optimize perf_event_task_ctx() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.452227115@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:36 +01:00
Peter Zijlstra	8152018387	perf: Optimize perf_swevent_ctx_event() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.378188589@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:35 +01:00
Peter Zijlstra	0cff784ae4	perf: Optimize some swcounter attr.sample_period==1 paths Avoid the rather expensive perf_swevent_set_period() if we know we have to sample every single event anyway. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.299508332@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:35 +01:00
Peter Zijlstra	453f19eea7	perf: Allow for custom overflow handlers in-kernel perf users might wish to have custom actions on the sample interrupt. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.222339539@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:35 +01:00
Ingo Molnar	96200591a3	Merge branch 'tracing/hw-breakpoints' into perf/core Conflicts: arch/x86/kernel/kprobes.c kernel/trace/Makefile Merge reason: hw-breakpoints perf integration is looking good in testing and in reviews, plus conflicts are mounting up - so merge & resolve. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:07:23 +01:00
Thomas Gleixner	34769945f7	genirq: Fix spurious irq seqfile conversion single_open data argument must be PDE(inode)->data instead of NULL otherwise seq_file->private is always NULL and we always read the spurious data of irq 0. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-20 11:55:26 +01:00
David Howells	3bde31a4ac	SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed Add a function to allow a requeueable work item to sleep till the thread processing it is needed by the slow-work facility to perform other work. Sometimes a work item can't progress immediately, but must wait for the completion of another work item that's currently being processed by another slow-work thread. In some circumstances, the waiting item could instead - theoretically - put itself back on the queue and yield its thread back to the slow-work facility, thus waiting till it gets processing time again before attempting to progress. This would allow other work items processing time on that thread. However, this only works if there is something on the queue for it to queue behind - otherwise it will just get a thread again immediately, and will end up cycling between the queue and the thread, eating up valuable CPU time. So, slow_work_sleep_till_thread_needed() is provided such that an item can put itself on a wait queue that will wake it up when the event it is actually interested in occurs, then call this function in lieu of calling schedule(). This function will then sleep until either the item's event occurs or another work item appears on the queue. If another work item is queued, but the item's event hasn't occurred, then the work item should requeue itself and yield the thread back to the slow-work facility by returning. This can be used by CacheFiles for an object that is being created on one thread to wait for an object being deleted on another thread where there is nothing on the queue for the creation to go and wait behind. As soon as an item appears on the queue that could be given thread time instead, CacheFiles can stick the creating object back on the queue and return to the slow-work facility - assuming the object deletion didn't also complete. Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:57 +00:00
David Howells	8fba10a42d	SLOW_WORK: Allow the work items to be viewed through a /proc file Allow the executing and queued work items to be viewed through a /proc file for debugging purposes. The contents look something like the following: THR PID ITEM ADDR FL MARK DESC === ===== ================ == ===== ========== 0 3005 ffff880023f52348 a 952ms FSC: OBJ17d3: LOOK 1 3006 ffff880024e33668 2 160ms FSC: OBJ17e5 OP60d3b: Write1/Store fl=2 2 3165 ffff8800296dd180 a 424ms FSC: OBJ17e4: LOOK 3 4089 ffff8800262c8d78 a 212ms FSC: OBJ17ea: CRTN 4 4090 ffff88002792bed8 2 388ms FSC: OBJ17e8 OP60d36: Write1/Store fl=2 5 4092 ffff88002a0ef308 2 388ms FSC: OBJ17e7 OP60d2e: Write1/Store fl=2 6 4094 ffff88002abaf4b8 2 132ms FSC: OBJ17e2 OP60d4e: Write1/Store fl=2 7 4095 ffff88002bb188e0 a 388ms FSC: OBJ17e9: CRTN vsq - ffff880023d99668 1 308ms FSC: OBJ17e0 OP60f91: Write1/EnQ fl=2 vsq - ffff8800295d1740 1 212ms FSC: OBJ16be OP4d4b6: Write1/EnQ fl=2 vsq - ffff880025ba3308 1 160ms FSC: OBJ179a OP58dec: Write1/EnQ fl=2 vsq - ffff880024ec83e0 1 160ms FSC: OBJ17ae OP599f2: Write1/EnQ fl=2 vsq - ffff880026618e00 1 160ms FSC: OBJ17e6 OP60d33: Write1/EnQ fl=2 vsq - ffff880025a2a4b8 1 132ms FSC: OBJ16a2 OP4d583: Write1/EnQ fl=2 vsq - ffff880023cbe6d8 9 212ms FSC: OBJ17eb: LOOK vsq - ffff880024d37590 9 212ms FSC: OBJ17ec: LOOK vsq - ffff880027746cb0 9 212ms FSC: OBJ17ed: LOOK vsq - ffff880024d37ae8 9 212ms FSC: OBJ17ee: LOOK vsq - ffff880024d37cb0 9 212ms FSC: OBJ17ef: LOOK vsq - ffff880025036550 9 212ms FSC: OBJ17f0: LOOK vsq - ffff8800250368e0 9 212ms FSC: OBJ17f1: LOOK vsq - ffff880025036aa8 9 212ms FSC: OBJ17f2: LOOK In the 'THR' column, executing items show the thread they're occupying and queued threads indicate which queue they're on. 'PID' shows the process ID of a slow-work thread that's executing something. 'FL' shows the work item flags. 'MARK' indicates how long since an item was queued or began executing. Lastly, the 'DESC' column permits the owner of an item to give some information. Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:51 +00:00
Jens Axboe	6b8268b17a	SLOW_WORK: Add delayed_slow_work support This adds support for starting slow work with a delay, similar to the functionality we have for workqueues. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:47 +00:00
Jens Axboe	0160950297	SLOW_WORK: Add support for cancellation of slow work Add support for cancellation of queued slow work and delayed slow work items. The cancellation functions will wait for items that are pending or undergoing execution to be discarded by the slow work facility. Attempting to enqueue work that is in the process of being cancelled will result in ECANCELED. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:43 +00:00
Jens Axboe	4d8bb2cbcc	SLOW_WORK: Make slow_work_ops ->get_ref/->put_ref optional Make the ability for the slow-work facility to take references on a work item optional as not everyone requires this. Even the internal slow-work stubs them out, so those can be got rid of too. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:39 +00:00
David Howells	3d7a641e54	SLOW_WORK: Wait for outstanding work items belonging to a module to clear Wait for outstanding slow work items belonging to a module to clear when unregistering that module as a user of the facility. This prevents the put_ref code of a work item from being taken away before it returns. Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:23 +00:00
David S. Miller	3505d1a9fd	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/sfc/sfe4001.c drivers/net/wireless/libertas/cmd.c drivers/staging/Kconfig drivers/staging/Makefile drivers/staging/rtl8187se/Kconfig drivers/staging/rtl8192e/Kconfig	2009-11-18 22:19:03 -08:00
Eric W. Biederman	6d4561110a	sysctl: Drop & in front of every proc_handler. For consistency drop & in front of every proc_handler. Explicity taking the address is unnecessary and it prevents optimizations like stubbing the proc_handlers to NULL. Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-18 08:37:40 -08:00
Stanislaw Gruszka	8747d793fc	itimers: Fix racy writes to cpu_itimer fields incr_error and error fields of struct cpu_itimer are used when calculating next timer tick in check_cpu_itimers() and should not be modified without tsk->sighand->siglock taken. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <1253802903-979-1-git-send-email-sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 16:32:12 +01:00
Rusty Russell	2ea6dec4a2	generic-ipi: Add smp_call_function_any() Andrew points out that acpi-cpufreq uses cpumask_any, when it really would prefer to use the same CPU if possible (to avoid an IPI). In general, this seems a good idea to offer. [ tglx: Documented selection preference and Inlined the UP case to avoid the copy of smp_call_function_single() and the extra EXPORT ] Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 14:52:25 +01:00
Alexey Dobriyan	a1afb6371b	genirq: switch /proc/irq/*/spurious to seq_file [ tglx: compacted it a bit ] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> LKML-Reference: <20090828181743.GA14050@x200.localdomain> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 12:50:51 +01:00
Stanislaw Gruszka	ba5ea951d0	posix-cpu-timers: optimize and document timer_create callback We have already new_timer initialized to all-zeros hence in function initializations are not needed. Document function expectation about new_timer argument as well. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: johnstul@us.ibm.com Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 12:36:05 +01:00
H Hartley Sweeten	8e1a928a2e	clockevents: Add missing include to pacify sparse Include "tick-internal.h" in order to pick up the extern function prototype for clockevents_shutdown(). This quiets the following sparse build noise: warning: symbol 'clockevents_shutdown' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE190901E24550@mi8nycmail19.Mi8.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Cc: johnstul@us.ibm.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 12:31:48 +01:00
Tejun Heo	9398180097	workqueue: fix race condition in schedule_on_each_cpu() Commit `65a6446434` ("HWPOISON: Allow schedule_on_each_cpu() from keventd") which allows schedule_on_each_cpu() to be called from keventd added a race condition. schedule_on_each_cpu() may race with cpu hotplug and end up executing the function twice on a cpu. Fix it by moving direct execution into the section protected with get/put_online_cpus(). While at it, update code such that direct execution is done after works have been scheduled for all other cpus and drop unnecessary cpu != orig test from flush loop. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-11-17 17:40:33 -08:00
Lai Jiangshan	f6060f4681	tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used Prevent build warning when CONFIG_FUNCTION_GRAPH_TRACER is not set. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4AF24381.5060307@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-17 11:05:49 -05:00
Carsten Emde	c13d2f7c32	tracing: Fix trace_marker output When a string was written to <debugfs>/tracing/trace_marker, some strange characters appeared in the trace output instead of the string, since a vprint function erroneously called a vararg print function with a va_list argument. This patch fixes the problem and simplifies the related code. Signed-off-by: Carsten Emde <C.Emde@osadl.org> LKML-Reference: <4B01AE5D.1010801@osadl.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-17 09:19:06 -05:00
Steven Rostedt	5a50e33cc9	ring-buffer: Move access to commit_page up into function used With the change of the way we process commits. Where a commit only happens at the outer most level, and that we don't need to worry about a commit ending after the rb_start_commit() has been called, the code use to grab the commit page before the tail page to prevent a possible race. But this race no longer exists with the rb_start_commit() rb_end_commit() interface. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-17 08:43:01 -05:00
Lin Ming	0696b711e4	timekeeping: Fix clock_gettime vsyscall time warp Since commit `0a544198` "timekeeping: Move NTP adjusted clock multiplier to struct timekeeper" the clock multiplier of vsyscall is updated with the unmodified clock multiplier of the clock source and not with the NTP adjusted multiplier of the timekeeper. This causes user space observerable time warps: new CLOCK-warp maximum: 120 nsecs, 00000025c337c537 -> 00000025c337c4bf Add a new argument "mult" to update_vsyscall() and hand in the timekeeping internal NTP adjusted multiplier. Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: "Zhang Yanmin" <yanmin_zhang@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tony Luck <tony.luck@intel.com> LKML-Reference: <1258436990.17765.83.camel@minggr.sh.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-17 11:52:34 +01:00
Ingo Molnar	a7b63425a4	Merge branch 'perf/core' into perf/probes Resolved merge conflict in tools/perf/Makefile Merge reason: we want to queue up a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-17 10:17:47 +01:00
Eric W. Biederman	bb9074ff58	Merge commit 'v2.6.32-rc7' Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler gets a small bug fix and the sysctl tree where I am removing all sysctl strategy routines.	2009-11-17 01:01:34 -08:00
Peter Zijlstra	559fdc3c1b	perf_event: Optimize perf_output_lock() The purpose of perf_output_{un,}lock() is to: 1) avoid publishing incomplete data [ possible when publishing a head that is ahead of an entry that is still being written ] 2) guarantee fwd progress [ a simple refcount on pending writers doesn't need to drop to 0, making it so would end up implementing something like forced quiecent states of RCU ] To satisfy the above without undue complexity it serializes between CPUs, this means that a pending writer can only be the same cpu in a nested context, and since (under normal operation) a cpu always makes progress we're good -- if the head is only published when the bottom most writer completes. Now we don't need to disable IRQs in order to serialize between CPUs, disabling preemption ought to be sufficient, esp since we already deal with nesting due to NMIs. This avoids potentially expensive (and needless) local IRQ disable/enable ops. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1258373161.26714.254.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-16 13:27:45 +01:00
Peter Zijlstra	047106adcc	sched: Sched_rt_periodic_timer vs cpu hotplug Heiko reported a case where a timer interrupt managed to reference a root_domain structure that was already freed by a concurrent hot-un-plug operation. Solve this like the regular sched_domain stuff is also synchronized, by adding a synchronize_sched() stmt to the free path, this ensures that a root_domain stays present for any atomic section that could have observed it. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Gregory Haskins <ghaskins@novell.com> Cc: Siddha Suresh B <suresh.b.siddha@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <1258363873.26714.83.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-16 10:46:27 +01:00
Thomas Gleixner	dc186ad741	workqueue: Add debugobjects support Add debugobject support to track the life time of work_structs. While at it, remove duplicate definition of INIT_DELAYED_WORK_ON_STACK(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2009-11-16 01:09:48 +09:00
Tejun Heo	498657a478	sched, kvm: Fix race condition involving sched_in_preempt_notifers In finish_task_switch(), fire_sched_in_preempt_notifiers() is called after finish_lock_switch(). However, depending on architecture, preemption can be enabled after finish_lock_switch() which breaks the semantics of preempt notifiers. So move it before finish_arch_switch(). This also makes the in- notifiers symmetric to out- notifiers in terms of locking - now both are called under rq lock. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Avi Kivity <avi@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4AFD2801.7020900@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-15 09:59:54 +01:00
Ingo Molnar	0ffa798d94	Merge branches 'perf/powerpc' and 'perf/bench' into perf/core Merge reason: Both 'perf bench' and the pending PowerPC changes are now ready for the next merge window. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-15 09:51:24 +01:00
Ingo Molnar	39dc78b651	Merge commit 'v2.6.32-rc7' into perf/core Merge reason: pick up perf fixlets Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-15 09:50:41 +01:00
Paul E. McKenney	2f51f9884f	rcu: Eliminate __rcu_pending() false positives Now that there are both ->gpnum and ->completed fields in the rcu_node structure, __rcu_pending() should check rdp->gpnum and rdp->completed against rnp->gpnum and rdp->completed, respectively, instead of the prior comparison against the rcu_state fields rsp->gpnum and rsp->completed. Given the old comparison, __rcu_pending() could return 1, resulting in a needless raise_softirq(RCU_SOFTIRQ). This useless work would happen if RCU responded to a scheduling-clock interrupt after the rcu_state fields had been updated, but before the rcu_node fields had been updated. Changing the comparison from the rcu_state fields to the rcu_node fields prevents this useless work from happening. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12581706991966-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-14 10:31:42 +01:00
Paul E. McKenney	560d4bc0df	rcu: Further cleanups of use of lastcomp Now that a copy of the rsp->completed flag is available in all rcu_node structures, make full use of it. It is still legitimate to access rsp->completed while holding the root rcu_node structure's lock, however. Also, tighten up force_quiescent_state()'s checks for end of current grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1258170699933-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-14 10:31:42 +01:00
Thomas Gleixner	a362c638bd	clocksource/events: Fix fallout of generic code changes powerpc grew a new warning due to the type change of clockevent->mult. The architectures which use parts of the generic time keeping infrastructure tripped over my wrong assumption that clocksource_register is only used when GENERIC_TIME=y. I should have looked and also I should have known better. These renitent Gaul villages are racking my nerves. Some serious deprecating is due. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-14 00:35:52 +01:00
Thomas Gleixner	8e13c7b772	locking: Reduce ifdefs in kernel/spinlock.c With the Kconfig based inline decisions we can remove extra ifdefs in kernel/spinlock.c by creating the complex lockbreak functions as inlines which are inserted into the non inlined lock functions. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091109151428.548614772@linutronix.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2009-11-13 20:53:28 +01:00
Thomas Gleixner	6beb000923	locking: Make inlining decision Kconfig based commit `892a7c67` (locking: Allow arch-inlined spinlocks) implements the selection of which lock functions are inlined based on defines in arch/.../spinlock.h: #define __always_inline__LOCK_FUNCTION Despite of the name __always_inline__* the lock functions can be built out of line depending on config options. Also if the arch does not set some inline defines the generic code might set them; again depending on config options. This makes it unnecessary hard to figure out when and which lock functions are inlined. Aside of that it makes it way harder and messier for -rt to manipulate the lock functions. Convert the inlining decision to CONFIG switches. Each lock function is inlined depending on CONFIG_INLINE_. The configs implement the existing dependencies. The architecture code can select ARCH_INLINE_ to signal that it wants the corresponding lock function inlined. ARCH_INLINE_* is necessary as Kconfig ignores "depends on" restrictions when a config element is selected. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091109151428.504477141@linutronix.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2009-11-13 20:53:28 +01:00
Jon Hunter	97813f2fe7	nohz: Allow 32-bit machines to sleep for more than 2.15 seconds In the dynamic tick code, "max_delta_ns" (member of the "clock_event_device" structure) represents the maximum sleep time that can occur between timer events in nanoseconds. The variable, "max_delta_ns", is defined as an unsigned long which is a 32-bit integer for 32-bit machines and a 64-bit integer for 64-bit machines (if -m64 option is used for gcc). The value of max_delta_ns is set by calling the function "clockevent_delta2ns()" which returns a maximum value of LONG_MAX. For a 32-bit machine LONG_MAX is equal to 0x7fffffff and in nanoseconds this equates to ~2.15 seconds. Hence, the maximum sleep time for a 32-bit machine is ~2.15 seconds, where as for a 64-bit machine it will be many years. This patch changes the type of max_delta_ns to be "u64" instead of "unsigned long" so that this variable is a 64-bit type for both 32-bit and 64-bit machines. It also changes the maximum value returned by clockevent_delta2ns() to KTIME_MAX. Hence this allows a 32-bit machine to sleep for longer than ~2.15 seconds. Please note that this patch also changes "min_delta_ns" to be "u64" too and although this is unnecessary, it makes the patch simpler as it avoids to fixup all callers of clockevent_delta2ns(). [ tglx: changed "unsigned long long" to u64 as we use this data type through out the time code ] Signed-off-by: Jon Hunter <jon-hunter@ti.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1250617512-23567-3-git-send-email-jon-hunter@ti.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-13 20:46:24 +01:00
Thomas Gleixner	27185016b8	nohz: Track last do_timer() cpu The previous patch which limits the sleep time to the maximum deferment time of the time keeping clocksource has some limitations on SMP machines: if all CPUs are idle then for all CPUs the maximum sleep time is limited. Solve this by keeping track of which cpu had the do_timer() duty assigned last and limit the sleep time only for this cpu. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: Jon Hunter <jon-hunter@ti.com> Cc: John Stultz <johnstul@us.ibm.com>	2009-11-13 20:46:24 +01:00
Jon Hunter	98962465ed	nohz: Prevent clocksource wrapping during idle The dynamic tick allows the kernel to sleep for periods longer than a single tick, but it does not limit the sleep time currently. In the worst case the kernel could sleep longer than the wrap around time of the time keeping clock source which would result in losing track of time. Prevent this by limiting it to the safe maximum sleep time of the current time keeping clock source. The value is calculated when the clock source is registered. [ tglx: simplified the code a bit and massaged the commit msg ] Signed-off-by: Jon Hunter <jon-hunter@ti.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1250617512-23567-2-git-send-email-jon-hunter@ti.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-13 20:46:24 +01:00
Thomas Gleixner	529eaccd90	nohz: Type cast printk argument On some archs local_softirq_pending() has a data type of unsigned long on others its unsigned int. Type cast it to (unsigned int) in the printk to avoid the compiler warning. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission>	2009-11-13 20:46:24 +01:00
Thomas Gleixner	7d2f944a2b	clocksource: Provide a generic mult/shift factor calculation MIPS has two functions to calculcate the mult/shift factors for clock sources and clock events at run time. ARM needs such functions as well. Implement a function which calculates the mult/shift factors based on the frequencies to which and from which is converted. The function also has a parameter to specify the minimum conversion range in seconds. This range is guaranteed not to produce a 64bit overflow when a value is multiplied with the calculated mult factor. The larger the conversion range the less becomes the conversion accuracy. Provide two inline wrappers which handle clock events and clock sources. For clock events the "from" frequency is nano seconds per second which corresponds to 1GHz and "to" is the device frequency. For clock sources "from" is the device frequency and "to" is nano seconds per second. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20091111134229.766673305@linutronix.de>	2009-11-13 20:46:23 +01:00
Thomas Gleixner	23af368e9a	clockevents: Use u32 for mult and shift factors The mult and shift factors of clock events differ in their data type from those of clock sources for no reason. u32 is sufficient for both. shift is always <= 32 and mult is limited to 2^32-1 to avoid 64bit multiplication overflows in the conversion. Preparatory patch for a generic mult/shift factor calculation function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20091111134229.725664788@linutronix.de>	2009-11-13 20:46:23 +01:00
Frederic Weisbecker	67178767b9	tracing: Rename 'lockdep' event subsystem into 'lock' Lockdep events subsystem gathers various locking related events such as a request, release, contention or acquisition of a lock. The name of this event subsystem is a bit of a misnomer since these events are not quite related to lockdep but more generally to locking, ie: these events are not reporting lock dependencies or possible deadlock scenario but pure locking events. Hence this rename. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1258103194-843-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:48:27 +01:00
Paul E. McKenney	8e9aa8f067	rcu: Simplify association of forced quiescent states with grace periods The force_quiescent_state() function also took a snapshot of the ->completed field, which was as obnoxious as it was in rcu_sched_qs() and friends. So snapshot ->gpnum-1. Also, since the dyntick_record_completed() and dyntick_recall_completed() functions are now simple assignments that are independent of CONFIG_NO_HZ, and since their names are now misleading, get rid of them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12580941042308-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:18:36 +01:00
Paul E. McKenney	b32e9eb6ad	rcu: Accelerate callback processing on CPUs not detecting GP end An earlier fix for a race resulted in a situation where the CPUs other than the CPU that detected the end of the grace period would not process their callbacks until the next grace period started. This means that these other CPUs would unnecessarily demand that an extra grace period be started. This patch eliminates this extra grace period and speeds callback processing by propagating rsp->completed to the rcu_node structures in the case where the CPU detecting the end of the grace period sees no reason to start a new grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1258094104417-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:18:36 +01:00
Peter Zijlstra	fe3bcfe1f6	sched: More generic WAKE_AFFINE vs select_idle_sibling() Instead of only considering SD_WAKE_AFFINE \| SD_PREFER_SIBLING domains also allow all SD_PREFER_SIBLING domains below a SD_WAKE_AFFINE domain to change the affinity target. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091112145610.909723612@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:09:59 +01:00
Peter Zijlstra	a50bde5130	sched: Cleanup select_task_rq_fair() Clean up the new affine to idle sibling bits while trying to grok them. Should not have any function differences. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091112145610.832503781@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:09:58 +01:00
Hidetoshi Seto	761b1d26df	sched: Fix granularity of task_u/stime() Originally task_s/utime() were designed to return clock_t but later changed to return cputime_t by following commit: commit `efe567fc82` Author: Christian Borntraeger <borntraeger@de.ibm.com> Date: Thu Aug 23 15:18:02 2007 +0200 It only changed the type of return value, but not the implementation. As the result the granularity of task_s/utime() is still that of clock_t, not that of cputime_t. So using task_s/utime() in __exit_signal() makes values accumulated to the signal struct to be rounded and coarse grained. This patch removes casts to clock_t in task_u/stime(), to keep granularity of cputime_t over the calculation. v2: Use div_u64() to avoid error "undefined reference to `__udivdi3`" on some 32bit systems. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: xiyou.wangcong@gmail.com Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4AFB9029.9000208@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-12 15:23:47 +01:00
Mike Galbraith	055a00865d	sched: Fix/add missing update_rq_clock() calls kthread_bind(), migrate_task() and sched_fork were missing updates, and try_to_wake_up() was updating after having already used the stale clock. Aside from preventing potential latency hits, there' a side benefit in that early boot printk time stamps become monotonic. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1258020464.6491.2.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <new-submission>	2009-11-12 12:28:29 +01:00
Eric W. Biederman	4739a9748e	sysctl: Remove the last of the generic binary sysctl support Now that all of the users stopped using ctl_name and strategy it is safe to remove the fields from struct ctl_table, and it is safe to remove the stub strategy routines as well. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 02:05:06 -08:00
Eric W. Biederman	56992309cc	sysctl kernel: Remove binary sysctl logic Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name and .strategy members of sysctl tables are dead code. Remove them. Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 02:04:55 -08:00
Eric W. Biederman	757010f026	sysctl binary: Reorder the tests to process wild card entries first. A malicious user could have passed in a ctl_name of 0 and triggered the well know ctl_name to procname mapping code, instead of the wild card matching code. This is a slight problem as wild card entries don't have procnames, and because in some alternate universe a network device might have ifindex 0. So test for and handle wild card entries first. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 01:42:31 -08:00
Eric W. Biederman	63395b6597	sysctl: sysctl_binary.c Fix compilation when !CONFIG_NET dev_get_by_index does not exist when the network stack is not compiled in, so only include the code to follow wild card paths when the network stack is present. I have shuffled the code around a little to make it clear that dev_put is called after dev_get_by_index showing that there is no leak. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 01:25:43 -08:00
Steven Rostedt	8b2a5dac78	tracing: do not disable interrupts for trace_clock_local Disabling interrupts in trace_clock_local takes quite a performance hit to the recording of traces. Using perf top we see: ------------------------------------------------------------------------------ PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 2842.00 - 40.4% : trace_clock_local 1043.00 - 14.8% : rb_reserve_next_event 784.00 - 11.1% : ring_buffer_lock_reserve 600.00 - 8.5% : __rb_reserve_next 579.00 - 8.2% : rb_end_commit 440.00 - 6.3% : ring_buffer_unlock_commit 290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark] 155.00 - 2.2% : debug_smp_processor_id 117.00 - 1.7% : trace_recursive_unlock 103.00 - 1.5% : ring_buffer_event_data 28.00 - 0.4% : do_gettimeofday 22.00 - 0.3% : _spin_unlock_irq 14.00 - 0.2% : native_read_tsc 11.00 - 0.2% : getnstimeofday Where trace_clock_local is 40% of the tracing, and the time for recording a trace according to ring_buffer_benchmark is 210ns. After converting the interrupts to preemption disabling we have from perf top: ------------------------------------------------------------------------------ PerfTop: 1084 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 1277.00 - 16.8% : native_read_tsc 1148.00 - 15.1% : rb_reserve_next_event 896.00 - 11.8% : ring_buffer_lock_reserve 688.00 - 9.1% : __rb_reserve_next 664.00 - 8.8% : rb_end_commit 563.00 - 7.4% : ring_buffer_unlock_commit 508.00 - 6.7% : _spin_unlock_irq 365.00 - 4.8% : debug_smp_processor_id 321.00 - 4.2% : trace_clock_local 303.00 - 4.0% : ring_buffer_producer_thread [ring_buffer_benchmark] 273.00 - 3.6% : native_sched_clock 122.00 - 1.6% : trace_recursive_unlock 113.00 - 1.5% : sched_clock 101.00 - 1.3% : ring_buffer_event_data 53.00 - 0.7% : tick_nohz_stop_sched_tick Where trace_clock_local drops from 40% to only taking 4% of the total time. The trace time also goes from 210ns down to 179ns (31ns). I talked with Peter Zijlstra about the impact that sched_clock may have without having interrupts disabled, and he told me that if a timer interrupt comes in, sched_clock may report a wrong time. Balancing a seldom incorrect timestamp with a 15% performance boost, I'll take the performance boost. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-11 23:38:33 -05:00
Eric W. Biederman	2fb10732c3	sysctl: Warn about all uses of sys_sysctl. Now that the glibc pthread implemenation no longers uses sysctl() users of sysctl are as rare as hen's teeth. So remove the glibc exception from the warning, and use the standard printk_ratelimit instead of rolling our own. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 19:35:52 -08:00
Steven Rostedt	a6f0eb6adc	ring-buffer: Add multiple iterations between benchmark timestamps The ring_buffer_benchmark does a gettimeofday after every write to the ring buffer in its measurements. This adds the overhead of the call to gettimeofday to the measurements and does not give an accurate picture of the length of time it takes to record a trace. This was first noticed with perf top: ------------------------------------------------------------------------------ PerfTop: 679 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 1673.00 - 27.8% : trace_clock_local 806.00 - 13.4% : do_gettimeofday 590.00 - 9.8% : rb_reserve_next_event 554.00 - 9.2% : native_read_tsc 431.00 - 7.2% : ring_buffer_lock_reserve 365.00 - 6.1% : __rb_reserve_next 355.00 - 5.9% : rb_end_commit 322.00 - 5.4% : getnstimeofday 268.00 - 4.5% : ring_buffer_unlock_commit 262.00 - 4.4% : ring_buffer_producer_thread [ring_buffer_benchmark] 113.00 - 1.9% : read_tsc 91.00 - 1.5% : debug_smp_processor_id 69.00 - 1.1% : trace_recursive_unlock 66.00 - 1.1% : ring_buffer_event_data 25.00 - 0.4% : _spin_unlock_irq And the length of each write to the ring buffer measured at 310ns. This patch adds a new module parameter called "write_interval" which is defaulted to 50. This is the number of writes performed between timestamps. After this patch perf top shows: ------------------------------------------------------------------------------ PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 2842.00 - 40.4% : trace_clock_local 1043.00 - 14.8% : rb_reserve_next_event 784.00 - 11.1% : ring_buffer_lock_reserve 600.00 - 8.5% : __rb_reserve_next 579.00 - 8.2% : rb_end_commit 440.00 - 6.3% : ring_buffer_unlock_commit 290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark] 155.00 - 2.2% : debug_smp_processor_id 117.00 - 1.7% : trace_recursive_unlock 103.00 - 1.5% : ring_buffer_event_data 28.00 - 0.4% : do_gettimeofday 22.00 - 0.3% : _spin_unlock_irq 14.00 - 0.2% : native_read_tsc 11.00 - 0.2% : getnstimeofday do_gettimeofday dropped from 13% usage to a mere 0.4%! (using the default 50 interval) The measurement for each timestamp went from 310ns to 210ns. That's 100ns (1/3rd) overhead that the gettimeofday call was introducing. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-11 22:22:15 -05:00
David S. Miller	3586e0a9a4	clocksource/timecompare: Fix symbol exports to be GPL'd. Noticed by Thomas GLeixner. Signed-off-by: David S. Miller <davem@davemloft.net>	2009-11-11 19:06:30 -08:00
Roel Kluin	a646365cc3	tracing: Fix return value of tracing_stats_read() The function tracing_stats_read() mistakenly returns ENOMEM instead of the negative value -ENOMEM. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> LKML-Reference: <4AFB2C0B.50605@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-11 21:26:55 -05:00
Paul E. McKenney	0e0fc1c23e	rcu: Mark init-time-only rcu_bootup_announce() as __init Because rcu_bootup_announce() is used only at boot time, mark it as __init, presumably so that its memory can be reclaimed. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20091111192806.GA10073@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-11 21:27:42 +01:00
Linus Torvalds	961767b75d	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: highmem: Fix debug_kmap_atomic() to also handle KM_IRQ_PTE, KM_NMI, and KM_NMI_PTE highmem: Fix race in debug_kmap_atomic() which could cause warn_count to underflow rcu: Fix long-grace-period race between forcing and initialization uids: Prevent tear down race	2009-11-11 11:30:15 -08:00
Linus Torvalds	1fd18a871a	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: try_one_irq() must be called with irq disabled	2009-11-11 11:29:58 -08:00
Eric W. Biederman	2315ffa0a9	sysctl: Don't look at ctl_name and strategy in the generic code The ctl_name and strategy fields are unused, now that sys_sysctl is a compatibility wrapper around /proc/sys. No longer looking at them in the generic code is effectively what we are doing now and provides the guarantee that during further cleanups we can just remove references to those fields and everything will work ok. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:53:43 -08:00
Eric W. Biederman	6fce56ec91	sysctl: Remove references to ctl_name and strategy from the generic sysctl table Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name and .strategy members of sysctl tables are dead code. Remove them. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:56 -08:00
Eric W. Biederman	83ac201b4f	sysctl: Remove dead code from sysctl_check Now that the sys_sysctl is now a compatibility wrapper around /proc/sys we can remove much of sysctl_check and reduce it to a few remaining sanity checks. This completely decouples it from the binary sysctl system call. Little things like ensuring that the sysctl has not already been registered are all that remain. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:53 -08:00
Eric W. Biederman	a965cf946d	sysctl: Neuter the generic sysctl strategy routines. Now that sys_sysctl is a compatibility layer on top of /proc/sys these routines are never called but are still put in sysctl tables so I have reduced them to stubs until they can be removed entirely. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:51 -08:00
Eric W. Biederman	26a7034b40	sysctl: Reduce sys_sysctl to a compatibility wrapper around /proc/sys To simply maintenance and to be able to remove all of the binary sysctl support from various subsystems I have rewritten the binary sysctl code as a compatibility wrapper around proc/sys. The code is built around a hard coded table based on the table in sysctl_check.c that lists all of our current binary sysctls and provides enough information to convert from the sysctl binary input into into ascii and back again. New in this patch is the realization that the only dynamic entries that need to be handled have ifname as the asscii string and ifindex as their ctl_name. When a sys_sysctl is called the code now looks in the translation table converting the binary name to the path under /proc where the value is to be found. Opens that file, and calls into a format conversion wrapper that calls fop->read and then fop->write as appropriate. Since in practice the practically no one uses or tests sys_sysctl rewritting the code to be beautiful is a little silly. The redeeming merit of this work is it allows us to rip out all of the binary sysctl syscall support from everywhere else in the tree. Allowing us to remove a lot of dead (after this patch) and barely maintained code. In addition it becomes much easier to optimize the sysctl implementation for being the backing store of /proc/sys, without having to worry about sys_sysctl. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:05 -08:00
Paul E. McKenney	c64ac3ce06	rcu: Simplify association of quiescent states with grace periods The rdp->passed_quiesc_completed fields are used to properly associate the recorded quiescent state with a grace period. It is OK to wrongly associate a given quiescent state with a preceding grace period, but it is fatal to associate a given quiescent state with a grace period that begins after the quiescent state occurred. Grace periods are numbered, and the following fields track them: o ->gpnum is the number of the grace period currently in progress, or the number of the last grace period to complete if no grace period is currently in progress. o ->completed is the number of the last grace period to have completed. These two fields are equal if there is no grace period in progress, otherwise ->gpnum is one greater than ->completed. But the rdp->passed_quiesc_completed field compared against ->completed, and if equal, the quiescent state is presumed to count against the current grace period. The earlier code copied rdp->completed to rdp->passed_quiesc_completed, which has been made to work, but is error-prone. In contrast, copying one less than rdp->gpnum is guaranteed safe, because rdp->gpnum is not incremented until after the start of the corresponding grace period. At the end of the grace period, when ->completed has incremented, then any quiescent periods recorded previously will be discarded. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12578890421011-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:50 +01:00
Paul E. McKenney	4bcfe05503	rcu: Rename dynticks_completed to completed_fqs This field is used whether or not CONFIG_NO_HZ is set, so the old name of ->dynticks_completed is quite misleading. Change to ->completed_fqs, given that it the value that force_quiescent_state() is trying to drive the ->completed field away from. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12578890423298-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:50 +01:00
Paul E. McKenney	956539b759	rcu: Enable synchronize_sched_expedited() fastpath This patch adds a counter increment to enable tasks to actually take the synchronize_sched_expedited() function's fastpath. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1257889042435-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:49 +01:00
Paul E. McKenney	dbe01350fa	rcu: Remove inline from forward-referenced functions Some variants of gcc are reputed to dislike forward references to functions declared "inline". Remove the "inline" keyword from such functions. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12578890422402-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:49 +01:00
Peter Zijlstra	ffd44db5f0	sched: Make sure task has correct sched_class after policy change From the code in rt_mutex_setprio(), it is evident that the intention is that task's with a RT 'prio' value as a consequence of receiving a PI boost also have their 'sched_class' field set to '&rt_sched_class'. However, Peter noticed that the code in __setscheduler() could result in this intention being frustrated. Fix it. Reported-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <1257880321.4108.457.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 20:22:31 +01:00
Frederic Weisbecker	f60d24d2ad	hw-breakpoints: Fix broken hw-breakpoint sample module The hw-breakpoint sample module has been broken during the hw-breakpoint internals refactoring. Propagate the changes to it. Reported-by: "K. Prasad" <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-10 11:23:29 +01:00
Paul Mundt	676c0dbe6e	ksym_tracer: Support read accesses independent of read/write. All of the infrastructure already exists to support read accesses for platforms that support a read access independently of read/write (such as in the case of the SuperH UBC). This just trivially hooks up the read case by itself. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20091109083733.GA25848@linux-sh.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-10 11:10:08 +01:00
Mike Galbraith	eae0c9dfb5	sched: Fix and clean up rate-limit newidle code Commit `1b9508f`, "Rate-limit newidle" has been confirmed to fix the netperf UDP loopback regression reported by Alex Shi. This is a cleanup and a fix: - moved to a more out of the way spot - fix to ensure that balancing doesn't try to balance runqueues which haven't gone online yet, which can mess up CPU enumeration during boot. Reported-by: Alex Shi <alex.shi@intel.com> Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> # .32.x: `a1f84a3`: sched: Check for an idle shared cache Cc: <stable@kernel.org> # .32.x: `1b9508f`: sched: Rate-limit newidle Cc: <stable@kernel.org> # .32.x: `fd21073`: sched: Fix affinity logic Cc: <stable@kernel.org> # .32.x LKML-Reference: <1257821402.5648.17.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:25:58 +01:00
Paul E. McKenney	9160306e6f	rcu: Fix note_new_gpnum() uses of ->gpnum Impose a clear locking design on the note_new_gpnum() function's use of the ->gpnum counter. This is done by updating rdp->gpnum only from the corresponding leaf rcu_node structure's rnp->gpnum field, and even then only under the protection of that same rcu_node structure's ->lock field. Performance and scalability are maintained using a form of double-checked locking, and excessive spinning is avoided by use of the spin_trylock() function. The use of spin_trylock() is safe due to the fact that CPUs who fail to acquire this lock will try again later. The hierarchical nature of the rcu_node data structure limits contention (which could be limited further if need be using the RCU_FANOUT kernel parameter). Without this patch, obscure but quite possible races could result in a quiescent state that occurred during one grace period to be accounted to the following grace period, causing this following grace period to end prematurely. Not good! Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: <stable@kernel.org> # .32.x LKML-Reference: <12571987492350-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:12:11 +01:00
Paul E. McKenney	d09b62dfa3	rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter Impose a clear locking design on the rcu_process_gp_end() function's use of the ->completed counter. This is done by creating a ->completed field in the rcu_node structure, which can safely be accessed under the protection of that structure's lock. Performance and scalability are maintained by using a form of double-checked locking, so that rcu_process_gp_end() only acquires the leaf rcu_node structure's ->lock if a grace period has recently ended. This fix reduces rcutorture failure rate by at least two orders of magnitude under heavy stress with force_quiescent_state() being invoked artificially often. Without this fix, unsynchronized access to the ->completed field can cause rcu_process_gp_end() to advance callbacks whose grace period has not yet expired. (Bad idea!) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: <stable@kernel.org> # .32.x LKML-Reference: <12571987494069-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:11:54 +01:00
Paul E. McKenney	281d150c5f	rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter Impose a clear locking design on non-NO_HZ handling of the ->completed counter. This increases the distance between the RCU and the CPU-hotplug mechanisms. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: <stable@kernel.org> # .32.x LKML-Reference: <12571987491353-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:11:17 +01:00
Ingo Molnar	7e1a2766e6	Merge branch 'core/urgent' into core/rcu Merge reason: Pick up RCU fixlet to base further commits on. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:10:35 +01:00
Eric Paris	dd8dbf2e68	security: report the module name to security_module_request For SELinux to do better filtering in userspace we send the name of the module along with the AVC denial when a program is denied module_request. Example output: type=SYSCALL msg=audit(11/03/2009 10:59:43.510:9) : arch=x86_64 syscall=write success=yes exit=2 a0=3 a1=7fc28c0d56c0 a2=2 a3=7fffca0d7440 items=0 ppid=1727 pid=1729 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=rpc.nfsd exe=/usr/sbin/rpc.nfsd subj=system_u:system_r:nfsd_t:s0 key=(null) type=AVC msg=audit(11/03/2009 10:59:43.510:9) : avc: denied { module_request } for pid=1729 comm=rpc.nfsd kmod="net-pf-10" scontext=system_u:system_r:nfsd_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=system Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-11-10 09:33:46 +11:00
Naohiro Ooiwa	f84d49b218	signal: Print warning message when dropping signals When the system has too many timers or too many aggregate queued signals, the EAGAIN error is returned to application from kernel, including timer_create() [POSIX.1b]. It means that the app exceeded the limit of pending signals, but in general application writers do not expect this outcome and the current silent failure can cause rare app failures under very high load. This patch adds a new message when we reach the limit and if print_fatal_signals is enabled: task/1234: reached RLIMIT_SIGPENDING, dropping signal If you see this message and your system behaved unexpectedly, you can run following command to lift the limit: # ulimit -i unlimited With help from Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>. Signed-off-by: Naohiro Ooiwa <nooiwa@miraclelinux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Roland McGrath <roland@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: oleg@redhat.com LKML-Reference: <4AF6E7E2.9080406@miraclelinux.com> [ Modified a few small details, gave surrounding code some love. ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-09 09:44:26 +01:00
Uwe Kleine-K�nig	b71a8eb0fa	tree-wide: fix typos "selct" + "slect" -> "select" This patch was generated by git grep -E -i -l 's(le\|el)ct' \| xargs -r perl -p -i -e 's/([Ss])(le\|el)ct/$1elect/ with only skipping net/netfilter/xt_SECMARK.c and include/linux/netfilter/xt_SECMARK.h which have a struct member called selctx. Signed-off-by: Uwe Kleine-K�nig <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-11-09 09:40:56 +01:00
Li Zefan	30ff21e31f	ksym_tracer: Remove KSYM_SELFTEST_ENTRY The macro used to be used in both trace_selftest.c and trace_ksym.c, but no longer, so remove it from header file. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-08 16:21:01 +01:00
Frederic Weisbecker	ba1c813a6b	hw-breakpoints: Arbitrate access to pmu following registers constraints Allow or refuse to build a counter using the breakpoints pmu following given constraints. We keep track of the pmu users by using three per cpu variables: - nr_cpu_bp_pinned stores the number of pinned cpu breakpoints counters in the given cpu - nr_bp_flexible stores the number of non-pinned breakpoints counters in the given cpu. - task_bp_pinned stores the number of pinned task breakpoints in a cpu The latter is not a simple counter but gathers the number of tasks that have n pinned breakpoints. Considering HBP_NUM the number of available breakpoint address registers: task_bp_pinned[0] is the number of tasks having 1 breakpoint task_bp_pinned[1] is the number of tasks having 2 breakpoints [...] task_bp_pinned[HBP_NUM - 1] is the number of tasks having the maximum number of registers (HBP_NUM). When a breakpoint counter is created and wants an access to the pmu, we evaluate the following constraints: == Non-pinned counter == - If attached to a single cpu, check: (per_cpu(nr_bp_flexible, cpu) \|\| (per_cpu(nr_cpu_bp_pinned, cpu) + max(per_cpu(task_bp_pinned, cpu)))) < HBP_NUM -> If there are already non-pinned counters in this cpu, it means there is already a free slot for them. Otherwise, we check that the maximum number of per task breakpoints (for this cpu) plus the number of per cpu breakpoint (for this cpu) doesn't cover every registers. - If attached to every cpus, check: (per_cpu(nr_bp_flexible, ) \|\| (max(per_cpu(nr_cpu_bp_pinned, )) + max(per_cpu(task_bp_pinned, )))) < HBP_NUM -> This is roughly the same, except we check the number of per cpu bp for every cpu and we keep the max one. Same for the per tasks breakpoints. == Pinned counter == - If attached to a single cpu, check: ((per_cpu(nr_bp_flexible, cpu) > 1) + per_cpu(nr_cpu_bp_pinned, cpu) + max(per_cpu(task_bp_pinned, cpu))) < HBP_NUM -> Same checks as before. But now the nr_bp_flexible, if any, must keep one register at least (or flexible breakpoints will never be be fed). - If attached to every cpus, check: ((per_cpu(nr_bp_flexible, ) > 1) + max(per_cpu(nr_cpu_bp_pinned, )) + max(per_cpu(task_bp_pinned, ))) < HBP_NUM Changes in v2: - Counter -> event rename Changes in v5: - Fix unreleased non-pinned task-bound-only counters. We only released it in the first cpu. (Thanks to Paul Mackerras for reporting that) Changes in v6: - Currently, events scheduling are done in this order: cpu context pinned + cpu context non-pinned + task context pinned + task context non-pinned events. Then our current constraints are right theoretically but not in practice, because non-pinned counters may be scheduled before we can apply every possible pinned counters. So consider non-pinned counters as pinned for now. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org>	2009-11-08 16:20:47 +01:00
Frederic Weisbecker	24f1e32c60	hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events This patch rebase the implementation of the breakpoints API on top of perf events instances. Each breakpoints are now perf events that handle the register scheduling, thread/cpu attachment, etc.. The new layering is now made as follows: ptrace kgdb ftrace perf syscall \ \| / / \ \| / / / Core breakpoint API / / \| / \| / Breakpoints perf events \| \| Breakpoints PMU ---- Debug Register constraints handling (Part of core breakpoint API) \| \| Hardware debug registers Reasons of this rewrite: - Use the centralized/optimized pmu registers scheduling, implying an easier arch integration - More powerful register handling: perf attributes (pinned/flexible events, exclusive/non-exclusive, tunable period, etc...) Impact: - New perf ABI: the hardware breakpoints counters - Ptrace breakpoints setting remains tricky and still needs some per thread breakpoints references. Todo (in the order): - Support breakpoints perf counter events for perf tools (ie: implement perf_bpcounter_event()) - Support from perf tools Changes in v2: - Follow the perf "event " rename - The ptrace regression have been fixed (ptrace breakpoint perf events weren't released when a task ended) - Drop the struct hw_breakpoint and store generic fields in perf_event_attr. - Separate core and arch specific headers, drop asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h - Use new generic len/type for breakpoint - Handle off case: when breakpoints api is not supported by an arch Changes in v3: - Fix broken CONFIG_KVM, we need to propagate the breakpoint api changes to kvm when we exit the guest and restore the bp registers to the host. Changes in v4: - Drop the hw_breakpoint_restore() stub as it is only used by KVM - EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a module - Restore the breakpoints unconditionally on kvm guest exit: TIF_DEBUG_THREAD doesn't anymore cover every cases of running breakpoints and vcpu->arch.switch_db_regs might not always be set when the guest used debug registers. (Waiting for a reliable optimization) Changes in v5: - Split-up the asm-generic/hw-breakpoint.h moving to linux/hw_breakpoint.h into a separate patch - Optimize the breakpoints restoring while switching from kvm guest to host. We only want to restore the state if we have active breakpoints to the host, otherwise we don't care about messed-up address registers. - Add asm/hw_breakpoint.h to Kbuild - Fix bad breakpoint type in trace_selftest.c Changes in v6: - Fix wrong header inclusion in trace.h (triggered a build error with CONFIG_FTRACE_SELFTEST Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org>	2009-11-08 15:34:42 +01:00
Lai Jiangshan	d8c80ce091	sched, no_hz: Remove unused rq->last_tick_seen field In `15934a3732`, field last_tick_seen is added to struct rq. But it is unused now. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Guillaume Chazarain <guichaz@yahoo.fr> LKML-Reference: <4AE6A513.6010100@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 13:17:47 +01:00
Cyrill Gorcunov	e9036b36ee	sched: Use root_task_group_empty only with FAIR_GROUP_SCHED root_task_group_empty is used only with FAIR_GROUP_SCHED so if we use other scheduler options we get: kernel/sched.c:314: warning: 'root_task_group_empty' defined but not used So move CONFIG_FAIR_GROUP_SCHED up that it covers root_task_group_empty(). Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091026192414.GB5321@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 13:15:48 +01:00
Cyrill Gorcunov	c82a43d40b	irq: Do not attempt to create subdirectories if /proc/irq/<irq> failed If a parent directory (ie /proc/irq/<irq>) could not be created we should not attempt to create subdirectories. Otherwise it would lead that "smp_affinity" and "spurious" entries are may be registered under /proc root instead of a proper place. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <20091026202811.GD5321@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 13:14:22 +01:00
Randy Dunlap	968c86458a	sched: Fix kernel-doc function parameter name Fix variable name in sched.c kernel-doc notation. Fixes this DocBook warning: Warning(kernel/sched.c:2008): No description found for parameter 'p' Warning(kernel/sched.c:2008): Excess function parameter 'k' description in 'kthread_bind' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <4AF4B1BC.8020604@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 11:26:25 +01:00
Frederic Weisbecker	444a2a3bcd	tracing, perf_events: Protect the buffer from recursion in perf While tracing using events with perf, if one enables the lockdep:lock_acquire event, it will infect every other perf trace events. Basically, you can enable whatever set of trace events through perf but if this event is part of the set, the only result we can get is a long list of lock_acquire events of rcu read lock, and only that. This is because of a recursion inside perf. 1) When a trace event is triggered, it will fill a per cpu buffer and submit it to perf. 2) Perf will commit this event but will also protect some data using rcu_read_lock 3) A recursion appears: rcu_read_lock triggers a lock_acquire event that will fill the per cpu event and then submit the buffer to perf. 4) Perf detects a recursion and ignores it 5) Perf continues its work on the previous event, but its buffer has been overwritten by the lock_acquire event, it has then been turned into a lock_acquire event of rcu read lock Such scenario also happens with lock_release with rcu_read_unlock(). We could turn the rcu_read_lock() into __rcu_read_lock() to drop the lock debugging from perf fast path, but that would make us lose the rcu debugging and that doesn't prevent from other possible kind of recursion from perf in the future. This patch adds a recursion protection based on a counter on the perf trace per cpu buffers to solve the problem. -v2: Fixed lost whitespace, added reviewed-by tag Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1257477185-7838-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 10:31:42 +01:00
Yong Zhang	e7e7e0c084	genirq: try_one_irq() must be called with irq disabled Prarit reported: ================================= [ INFO: inconsistent lock state ] 2.6.32-rc5 #1 --------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (&irq_desc_lock_class){?.-...}, at: [<ffffffff810c264e>] try_one_irq+0x32/0x138 {IN-HARDIRQ-W} state was registered at: [<ffffffff81095160>] __lock_acquire+0x2fc/0xd5d [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d [<ffffffff814cdadd>] _spin_lock+0x40/0x89 [<ffffffff810c3389>] handle_level_irq+0x30/0x105 [<ffffffff81014e0e>] handle_irq+0x95/0xb7 [<ffffffff810141bd>] do_IRQ+0x6a/0xe0 [<ffffffff81012813>] ret_from_intr+0x0/0x16 irq event stamp: 195096 hardirqs last enabled at (195096): [<ffffffff814cd7f7>] _spin_unlock_irq+0x3a/0x5c hardirqs last disabled at (195095): [<ffffffff814cdbdd>] _spin_lock_irq+0x29/0x95 softirqs last enabled at (195088): [<ffffffff81068c92>] __do_softirq+0x1c1/0x1ef softirqs last disabled at (195093): [<ffffffff8101304c>] call_softirq+0x1c/0x30 other info that might help us debug this: 1 lock held by swapper/0: #0: (kernel/irq/spurious.c:21){+.-...}, at: [<ffffffff81070cf2>] run_timer_softirq+0x1a9/0x315 stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.32-rc5 #1 Call Trace: <IRQ> [<ffffffff81093e94>] valid_state+0x187/0x1ae [<ffffffff81093fe4>] mark_lock+0x129/0x253 [<ffffffff810951d4>] __lock_acquire+0x370/0xd5d [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d [<ffffffff814cdadd>] _spin_lock+0x40/0x89 [<ffffffff810c264e>] try_one_irq+0x32/0x138 [<ffffffff810c2795>] poll_all_shared_irqs+0x41/0x6d [<ffffffff810c27dd>] poll_spurious_irqs+0x1c/0x49 [<ffffffff81070d82>] run_timer_softirq+0x239/0x315 [<ffffffff81068bd3>] __do_softirq+0x102/0x1ef [<ffffffff8101304c>] call_softirq+0x1c/0x30 [<ffffffff81014b65>] do_softirq+0x59/0xca [<ffffffff810686ad>] irq_exit+0x58/0xae [<ffffffff81029b84>] smp_apic_timer_interrupt+0x94/0xba [<ffffffff81012a33>] apic_timer_interrupt+0x13/0x20 The reason is that try_one_irq() is called from hardirq context with interrupts disabled and from softirq context (poll_all_shared_irqs()) with interrupts enabled. Disable interrupts before calling it from poll_all_shared_irqs(). Reported-and-tested-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> LKML-Reference: <1257563773-4620-1-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-07 21:44:45 +01:00
Eric W. Biederman	642c6d946b	sysctl: Make do_sysctl static Now that all of the architectures use compat_sys_sysctl do_sysctl can become static. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:53:59 -08:00
Eric W. Biederman	942405f360	sysctl: Remove the cond_syscall entry for sys32_sysctl Now that all architechtures are use compat_sys_sysctl and sys32_sysctl does not exist there is not point in retaining a cond_syscall entry for it. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:53:59 -08:00
Eric W. Biederman	da3f6f9b3e	sysctl: Introduce a generic compat sysctl sysctl This uses compat_alloc_userspace to remove the various hacks to allow do_sysctl to write to throuh oldlenp. The rest of our mature compat syscall helper facitilies are used as well to ensure we have a nice clean maintainable compat syscall that can be used on all architectures. The motiviation for a generic compat sysctl (besides the obvious hack removal) is to reduce the number of compat sysctl defintions out there so I can refactor the binary sysctl implementation. ppc already used the name compat_sys_sysctl so I remove the ppcs version here. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:52:55 -08:00
Eric W. Biederman	2830b68361	sysctl: Refactor the binary sysctl handling to remove duplicate code Read in the binary sysctl path once, instead of reread it from user space each time the code needs to access a path element. The deprecated sysctl warning is moved to do_sysctl so that the compat_sysctl entries syscalls will also warn. The return of -ENOSYS when !CONFIG_SYSCTL_SYSCALL is moved to binary_sysctl. Always leaving a do_sysctl available that handles !CONFIG_SYSCTL_SYSCALL and printing the deprecated sysctl warning allows for a single defitition of the sysctl syscall. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:23:35 -08:00
Eric W. Biederman	afa588b265	sysctl: Separate the binary sysctl logic into it's own file. In preparation for more invasive cleanups separate the core binary sysctl logic into it's own file. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:20:07 -08:00
Linus Torvalds	608221fdf9	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c sched: Disable SD_PREFER_LOCAL at node level sched: Fix boot crash by zalloc()ing most of the cpu masks sched: Strengthen buddies and mitigate buddy induced latencies	2009-11-05 10:56:47 -08:00
Linus Torvalds	72cc129e8d	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ftrace: Fix unmatched locking in ftrace_regex_write() ring-buffer: Synchronize resizing buffer with reader lock	2009-11-05 10:56:25 -08:00
Mike Galbraith	fd210738f6	sched: Fix affinity logic in select_task_rq_fair() Ingo Molnar reported: [ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10 [ 26.808000] caller is vmstat_update+0x26/0x70 [ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887 [ 26.816000] Call Trace: [ 26.820000] [<c1924a24>] ? printk+0x28/0x3c [ 26.824000] [<c13258a0>] debug_smp_processor_id+0xf0/0x110 [ 26.824000] mount used greatest stack depth: 1464 bytes left [ 26.828000] [<c111d086>] vmstat_update+0x26/0x70 [ 26.832000] [<c1086418>] worker_thread+0x188/0x310 [ 26.836000] [<c10863b7>] ? worker_thread+0x127/0x310 [ 26.840000] [<c108d310>] ? autoremove_wake_function+0x0/0x60 [ 26.844000] [<c1086290>] ? worker_thread+0x0/0x310 [ 26.848000] [<c108cf0c>] kthread+0x7c/0x90 [ 26.852000] [<c108ce90>] ? kthread+0x0/0x90 [ 26.856000] [<c100c0a7>] kernel_thread_helper+0x7/0x10 [ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10 [ 26.864000] caller is vmstat_update+0x3c/0x70 Because this commit: `a1f84a3`: sched: Check for an idle shared cache in select_task_rq_fair() broke ->cpus_allowed. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: arjan@infradead.org Cc: <stable@kernel.org> LKML-Reference: <1257415066.12867.1.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-05 11:01:39 +01:00
Martin Schwidefsky	3c5d92a0cf	nohz: Introduce arch_needs_cpu Allow the architecture to request a normal jiffy tick when the system goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is used to prevent the system going fully idle if there has been an interrupt other than a clock comparator interrupt since the last wakeup. On s390 the HiperSockets response time for 1 connection ping-pong goes down from 42 to 34 microseconds. The CPU cost decreases by 27%. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <20090929122533.402715150@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-05 07:53:53 +01:00
Martin Schwidefsky	eed3b9cf3f	nohz: Reuse ktime in sub-functions of tick_check_idle. On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and tick_nohz_update_jiffies. Given the right conditions (ts->idle_active and/or ts->tick_stopped) both function get a time stamp with ktime_get. The same time stamp can be reused if both function require one. On s390 this change has the additional benefit that gcc inlines the tick_nohz_stop_idle function into tick_check_idle. The number of instructions to execute tick_check_idle drops from 225 to 144 (without the ktime_get optimization it is 367 vs 215 instructions). before: 0) \| tick_check_idle() { 0) \| tick_nohz_stop_idle() { 0) \| ktime_get() { 0) \| read_tod_clock() { 0) 0.601 us \| } 0) 1.765 us \| } 0) 3.047 us \| } 0) \| ktime_get() { 0) \| read_tod_clock() { 0) 0.570 us \| } 0) 1.727 us \| } 0) \| tick_do_update_jiffies64() { 0) 0.609 us \| } 0) 8.055 us \| } after: 0) \| tick_check_idle() { 0) \| ktime_get() { 0) \| read_tod_clock() { 0) 0.617 us \| } 0) 1.773 us \| } 0) \| tick_do_update_jiffies64() { 0) 0.593 us \| } 0) 4.477 us \| } Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: john stultz <johnstul@us.ibm.com> LKML-Reference: <20090929122533.206589318@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-05 07:53:53 +01:00
Bjorn Helgaas	1e5ad96790	resources: when allocate_resource() fails, leave resource untouched When "allocate_resource(root, new, size, ...)" fails, we currently clobber "new". This is inconvenient for the caller, who might care about the original contents of the resource. For example, when pci_bus_alloc_resource() fails, the "can't allocate mem resource %pR" message from pci_assign_resources() currently contains junk for the resource start/end. This patch delays the "new" update until we're about to return success. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2009-11-04 13:06:46 -08:00
Mike Galbraith	1b9508f683	sched: Rate-limit newidle Rate limit newidle to migration_cost. It's a win for all stages of sysbench oltp tests. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 19:13:48 +01:00
Mike Galbraith	a1f84a3ab8	sched: Check for an idle shared cache in select_task_rq_fair() When waking affine, check for an idle shared cache, and if found, wake to that CPU/sibling instead of the waker's CPU. This improves pgsql+oltp ramp up by roughly 8%. Possibly more for other loads, depending on overlap. The trade-off is a roughly 1% peak downturn if tasks are truly synchronous. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> LKML-Reference: <1256654138.17752.7.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 18:46:22 +01:00
Thomas Gleixner	663e695928	irq: Remove unused debug_poll_all_shared_irqs() commit `74296a8ed` added this function for debug purposes, but it was never used for anything. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-04 14:22:21 +01:00
Liuweni	24b26d4211	irq: Fix docbook comments Fix docbook comments to match the actual function names (set_irq_msi, handle_percpu_irq). Signed-off-by: Liuwenyi <qingshenlwy@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-04 14:13:14 +01:00
Rusty Russell	acc3f5d7ca	cpumask: Partition_sched_domains takes array of cpumask_var_t Currently partition_sched_domains() takes a 'struct cpumask *doms_new' which is a kmalloc'ed array of cpumask_t. You can't have such an array if 'struct cpumask' is undefined, as we plan for CONFIG_CPUMASK_OFFSTACK=y. So, we make this an array of cpumask_var_t instead: this is the same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case. Hence we add alloc_sched_domains() and free_sched_domains() functions. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <200911031453.40668.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 13:16:40 +01:00
Rusty Russell	e2c8806304	cpumask: Simplify sched_rt.c find_lowest_rq() wants to call pick_optimal_cpu() on the intersection of sched_domain_span(sd) and lowest_mask. Rather than doing a cpus_and into a temporary, we can open-code it. This actually makes the code slightly clearer, IMHO. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Gregory Haskins <ghaskins@novell.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <200911031453.15350.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 13:16:38 +01:00
Masami Hiramatsu	77b44d1b7c	tracing/kprobes: Rename Kprobe-tracer to kprobe-event Rename Kprobes-based event tracer to kprobes-based tracing event (kprobe-event), since it is not a tracer but an extensible tracing event interface. This also changes CONFIG_KPROBE_TRACER to CONFIG_KPROBE_EVENT and sets it y by default. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> LKML-Reference: <20091104001247.3454.14131.stgit@harusame> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 13:02:48 +01:00
Ingo Molnar	a2e7127153	Merge commit 'v2.6.32-rc6' into perf/core Conflicts: tools/perf/Makefile Merge reason: Resolve the conflict, merge to upstream and merge in perf fixes so we can add a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 11:59:45 +01:00
Hiroshi Shimamoto	9824a2b728	sched: Remove unused cpu_nr_migrations() cpu_nr_migrations() is not used, remove it. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4AF12A66.6020609@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 11:43:43 +01:00
Hiroshi Shimamoto	1477b6a7ed	sched: Remove unused __schedule() declaration __schedule() had been removed. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4AF129C8.3030008@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 11:43:42 +01:00
Li Zefan	ed146b2594	ftrace: Fix unmatched locking in ftrace_regex_write() When a command is passed to the set_ftrace_filter, then the ftrace_regex_lock is still held going back to user space. # echo 'do_open : foo' > set_ftrace_filter (still holding ftrace_regex_lock when returning to user space!) Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4AEF7F8A.3080300@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-04 01:42:10 -05:00
Lai Jiangshan	f7112949f6	ring-buffer: Synchronize resizing buffer with reader lock We got a sudden panic when we reduced the size of the ringbuffer. We can reproduce the panic by the following steps: echo 1 > events/sched/enable cat trace_pipe > /dev/null & while ((1)) do echo 12000 > buffer_size_kb echo 512 > buffer_size_kb done (not more than 5 seconds, panic ...) Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4AF01735.9060409@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-04 00:04:20 -05:00
Frederic Weisbecker	97eaf5300b	perf/core: Add a callback to perf events A simple callback in a perf event can be used for multiple purposes. For example it is useful for triggered based events like hardware breakpoints that need a callback to dispatch a triggered breakpoint event. v2: Simplify a bit the callback attribution as suggested by Paul Mackerras Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "K.Prasad" <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mundt <lethal@linux-sh.org>	2009-11-03 19:11:53 +01:00
Arjan van de Ven	fb0459d75c	perf/core: Provide a kernel-internal interface to get to performance counters There are reasons for kernel code to ask for, and use, performance counters. For example, in CPU freq governors this tends to be a good idea, but there are other examples possible as well of course. This patch adds the needed bits to do enable this functionality; they have been tested in an experimental cpufreq driver that I'm working on, and the changes are all that I needed to access counters properly. [fweisbec@gmail.com: added pid to perf_event_create_kernel_counter so that we can profile a particular task too TODO: Have a better error reporting, don't just return NULL in fail case.] v2: Remove the wrong comment about the fact perf_event_create_kernel_counter must be called from a kernel thread. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: "K.Prasad" <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Avi Kivity <avi@redhat.com> LKML-Reference: <20090925122556.2f8bd939@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-03 18:04:17 +01:00
Linus Torvalds	38dc63459f	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Remove some debug messages producing too much noise PM: Fix warning on suspend errors PM / Hibernate: Add newline to load_image() fail path PM / Hibernate: Fix error handling in save_image() PM / Hibernate: Fix blkdev refleaks PM / yenta: Split resume into early and late parts (rev. 4)	2009-11-03 07:52:57 -08:00
Ian Campbell	1d51075094	Correct nr_processes() when CPUs have been unplugged nr_processes() returns the sum of the per cpu counter process_counts for all online CPUs. This counter is incremented for the current CPU on fork() and decremented for the current CPU on exit(). Since a process does not necessarily fork and exit on the same CPU the process_count for an individual CPU can be either positive or negative and effectively has no meaning in isolation. Therefore calculating the sum of process_counts over only the online CPUs omits the processes which were started or stopped on any CPU which has since been unplugged. Only the sum of process_counts across all possible CPUs has meaning. The only caller of nr_processes() is proc_root_getattr() which calculates the number of links to /proc as stat->nlink = proc_root.nlink + nr_processes(); You don't have to be all that unlucky for the nr_processes() to return a negative value leading to a negative number of links (or rather, an apparently enormous number of links). If this happens then you can get failures where things like "ls /proc" start to fail because they got an -EOVERFLOW from some stat() call. Example with some debugging inserted to show what goes on: # ps haux\|wc -l nr_processes: CPU0: 90 nr_processes: CPU1: 1030 nr_processes: CPU2: -900 nr_processes: CPU3: -136 nr_processes: TOTAL: 84 proc_root_getattr. nlink 12 + nr_processes() 84 = 96 84 # echo 0 >/sys/devices/system/cpu/cpu1/online # ps haux\|wc -l nr_processes: CPU0: 85 nr_processes: CPU2: -901 nr_processes: CPU3: -137 nr_processes: TOTAL: -953 proc_root_getattr. nlink 12 + nr_processes() -953 = -941 75 # stat /proc/ nr_processes: CPU0: 84 nr_processes: CPU2: -901 nr_processes: CPU3: -137 nr_processes: TOTAL: -954 proc_root_getattr. nlink 12 + nr_processes() -954 = -942 File: `/proc/' Size: 0 Blocks: 0 IO Block: 1024 directory Device: 3h/3d Inode: 1 Links: 4294966354 Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2009-11-03 09:06:55.000000000 +0000 Modify: 2009-11-03 09:06:55.000000000 +0000 Change: 2009-11-03 09:06:55.000000000 +0000 I'm not 100% convinced that the per_cpu regions remain valid for offline CPUs, although my testing suggests that they do. If not then I think the correct solution would be to aggregate the process_count for a given CPU into a global base value in cpu_down(). This bug appears to pre-date the transition to git and it looks like it may even have been present in linux-2.6.0-test7-bk3 since it looks like the code Rusty patched in http://lwn.net/Articles/64773/ was already wrong. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-11-03 07:52:39 -08:00
Jiri Slaby	bf9fd67a03	PM / Hibernate: Add newline to load_image() fail path Finish a line by \n when load_image fails in the middle of loading. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-11-03 11:03:09 +01:00
Jiri Slaby	4ff277f9e4	PM / Hibernate: Fix error handling in save_image() There are too many retval variables in save_image(). Thus error return value from snapshot_read_next() may be ignored and only part of the snapshot (successfully) written. Remove 'error' variable, invert the condition in the do-while loop and convert the loop to use only 'ret' variable. Switch the rest of the function to consider only 'ret'. Also make sure we end printed line by \n if an error occurs. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-11-03 11:02:43 +01:00
Jiri Slaby	76b57e613f	PM / Hibernate: Fix blkdev refleaks While cruising through the swsusp code I found few blkdev reference leaks of resume_bdev. swsusp_read: remove blkdev_put altogether. Some fail paths do not do that. swsusp_check: make sure we always put a reference on fail paths software_resume: all fail paths between swsusp_check and swsusp_read omit swsusp_close. Add it in those cases. And since swsusp_read doesn't drop the reference anymore, do it here unconditionally. [rjw: Fixed a small coding style issue.] Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-11-03 11:01:46 +01:00
Mike Galbraith	b84ff7d6f1	sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c Eric Paris reported that commit `f685ceacab` causes boot time PREEMPT_DEBUG complaints. [ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314 [ 4.593043] caller is task_hot+0x86/0xd0 Since kthread_bind() messes with scheduler internals, move the body to sched.c, and lock the runqueue. Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Tested-by: Eric Paris <eparis@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1256813310.7574.3.camel@marge.simson.net> [ v2: fix !SMP build and clean up ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-03 07:25:00 +01:00
Linus Torvalds	3fe866ca6c	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Fix spurious wakeup for requeue_pi really	2009-11-02 09:46:33 -08:00
Linus Torvalds	bce8fc4cb7	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Remove -Wcast-align perf tools: Fix compatibility with libelf 0.8 and autodetect perf events: Don't generate events for the idle task when exclude_idle is set perf events: Fix swevent hrtimer sampling by keeping track of remaining time when enabling/disabling swevent hrtimers	2009-11-02 09:46:06 -08:00
Linus Torvalds	a5e3013d66	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Remove cpu arg from the rb_time_stamp() function tracing: Fix comment typo and documentation example tracing: Fix trace_seq_printf() return value tracing: Update *ppos instead of filp->f_pos	2009-11-02 09:45:44 -08:00
Ananth N Mavinakayanahalli	4dae560f97	kprobes: Sanitize struct kretprobe_instance allocations For as long as kretprobes have existed, we've allocated NR_CPUS instances of kretprobe_instance structures. With the default value of CONFIG_NR_CPUS increasing on certain architectures, we are potentially wasting kernel memory. See http://sourceware.org/bugzilla/show_bug.cgi?id=10839#c3 for more details. Use a saner num_possible_cpus() instead of NR_CPUS for allocation. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: fweisbec@gmail.com LKML-Reference: <20091030135310.GA22230@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-02 17:00:18 +01:00
Lai Jiangshan	c5e0cb3ddc	rcu: Cleanup: balance rcu_irq_enter()/rcu_irq_exit() calls Currently, rcu_irq_exit() is invoked only for CONFIG_NO_HZ, while rcu_irq_enter() is invoked unconditionally. This patch moves rcu_irq_exit() out from under CONFIG_NO_HZ so that the calls are balanced. This patch has no effect on the behavior of the kernel because both rcu_irq_enter() and rcu_irq_exit() are empty for !CONFIG_NO_HZ, but the code is easier to understand if the calls are obviously balanced in all cases. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12567428891605-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-02 16:06:37 +01:00
Paul E. McKenney	83f5b01ffb	rcu: Fix long-grace-period race between forcing and initialization Very long RCU read-side critical sections (50 milliseconds or so) can cause a race between force_quiescent_state() and rcu_start_gp() as follows on kernel builds with multi-level rcu_node hierarchies: 1. CPU 0 calls force_quiescent_state(), sees that there is a grace period in progress, and acquires ->fsqlock. 2. CPU 1 detects the end of the grace period, and so cpu_quiet_msk_finish() sets rsp->completed to rsp->gpnum. This operation is carried out under the root rnp->lock, but CPU 0 has not yet acquired that lock. Note that rsp->signaled is still RCU_SAVE_DYNTICK from the last grace period. 3. CPU 1 calls rcu_start_gp(), but no one wants a new grace period, so it drops the root rnp->lock and returns. 4. CPU 0 acquires the root rnp->lock and picks up rsp->completed and rsp->signaled, then drops rnp->lock. It then enters the RCU_SAVE_DYNTICK leg of the switch statement. 5. CPU 2 invokes call_rcu(), and now needs a new grace period. It calls rcu_start_gp(), which acquires the root rnp->lock, sets rsp->signaled to RCU_GP_INIT (too bad that CPU 0 is already in the RCU_SAVE_DYNTICK leg of the switch statement!) and starts initializing the rcu_node hierarchy. If there are multiple levels to the hierarchy, it will drop the root rnp->lock and initialize the lower levels of the hierarchy. 6. CPU 0 notes that rsp->completed has not changed, which permits both CPU 2 and CPU 0 to try updating it concurrently. If CPU 0's update prevails, later calls to force_quiescent_state() can count old quiescent states against the new grace period, which can in turn result in premature ending of grace periods. Not good. This patch adds an RCU_GP_IDLE state for rsp->signaled that is set initially at boot time and any time a grace period ends. This prevents CPU 0 from getting into the workings of force_quiescent_state() in step 4. Additional locking and checks prevent the concurrent update of rsp->signaled in step 6. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1256742889199-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-02 16:06:21 +01:00
Thomas Gleixner	b00bc0b237	uids: Prevent tear down race Ingo triggered the following warning: WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50() Hardware name: System Product Name ODEBUG: init active object type: timer_list Modules linked in: Pid: 2619, comm: dmesg Tainted: G W 2.6.32-rc5-tip+ #5298 Call Trace: [<81035443>] warn_slowpath_common+0x6a/0x81 [<8120e483>] ? debug_print_object+0x42/0x50 [<81035498>] warn_slowpath_fmt+0x29/0x2c [<8120e483>] debug_print_object+0x42/0x50 [<8120ec2a>] __debug_object_init+0x279/0x2d7 [<8120ecb3>] debug_object_init+0x13/0x18 [<810409d2>] init_timer_key+0x17/0x6f [<81041526>] free_uid+0x50/0x6c [<8104ed2d>] put_cred_rcu+0x61/0x72 [<81067fac>] rcu_do_batch+0x70/0x121 debugobjects warns about an enqueued timer being initialized. If CONFIG_USER_SCHED=y the user management code uses delayed work to remove the user from the hash table and tear down the sysfs objects. free_uid is called from RCU and initializes/schedules delayed work if the usage count of the user_struct is 0. The init/schedule happens outside of the uidhash_lock protected region which allows a concurrent caller of find_user() to reference the about to be destroyed user_struct w/o preventing the work from being scheduled. If the next free_uid call happens before the work timer expired then the active timer is initialized and the work scheduled again. The race was introduced in commit `5cb350ba` (sched: group scheduling, sysfs tunables) and made more prominent by commit `3959214f` (sched: delayed cleanup of user_struct) Move the init/schedule_delayed_work inside of the uidhash_lock protected region to prevent the race. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@us.ibm.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: stable@kernel.org	2009-11-02 16:02:39 +01:00
Rusty Russell	49557e6203	sched: Fix boot crash by zalloc()ing most of the cpu masks I got a boot crash when forcing cpumasks offstack on 32 bit, because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask wasn't zeroed). AFAICT the others need to be zeroed too: only nohz.ilb_grp_nohz_mask is initialized before use. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <200911022037.21282.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-02 15:48:54 +01:00
Li Zefan	5e9b397292	tracing: Fix to use __always_unused attribute ____ftrace_check_##name() is used for compile-time check on F_printk() only, so it should be marked as __unused instead of __used. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <4AEE2D01.4010305@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-02 15:47:54 +01:00
Stephen Rothwell	3c912b6eda	x86: Fix user return notifier put_cpu_var() invocation Today's linux-next build (x86_64 allmodconfig) failed like this: kernel/user-return-notifier.c: In function 'fire_user_return_notifiers': kernel/user-return-notifier.c:45: error: expected expression before ')' token Introduced by commit `7c68af6e32` ("core, x86: Add user return notifiers") from the tip and kvm trees but revealed by commit `e0fdb0e050` ("percpu: add __percpu for sparse") from the percpu tree. Before that percpu tree commit, "put_cpu_var()" would compile without error (even though it really needs a parameter). Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Avi Kivity <avi@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> LKML-Reference: <20091102161722.eea4358d.sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-02 07:58:59 +01:00
Linus Torvalds	8633322c5f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: sched: move rq_weight data array out of .percpu percpu: allow pcpu_alloc() to be called with IRQs off	2009-10-29 09:19:29 -07:00
Linus Torvalds	9532faeb29	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixes: param: fix setting arrays of bool param: fix NULL comparison on oom param: fix lots of bugs with writing charp params from sysfs, by leaking mem.	2009-10-29 09:18:20 -07:00
Linus Torvalds	3242f9804b	Merge branch 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison-2.6.32' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: HWPOISON: fix invalid page count in printk output HWPOISON: Allow schedule_on_each_cpu() from keventd HWPOISON: fix/proc/meminfo alignment HWPOISON: fix oops on ksm pages HWPOISON: Fix page count leak in hwpoison late kill in do_swap_page HWPOISON: return early on non-LRU pages HWPOISON: Add brief hwpoison description to Documentation HWPOISON: Clean up PR_MCE_KILL interface	2009-10-29 08:20:00 -07:00
Linus Torvalds	fefcfd431b	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Move drop_futex_key_refs out of spinlock'ed region rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang rcu: Stopgap fix for synchronize_rcu_expedited() for TREE_PREEMPT_RCU rcu: Prevent RCU IPI storms in presence of high call_rcu() load futex: Check for NULL keys in match_futex futex: Handle spurious wake up	2009-10-29 08:12:20 -07:00
Linus Torvalds	37c2ca2411	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf timechart: Improve the visual appearance of scheduler delays perf timechart: Fix the wakeup-arrows that point to non-visible processes perf top: Fix --delay_secs 0 division by zero perf tools: Bump version to 0.0.2 perf_event: Adjust frequency and unthrottle for non-group-leader events	2009-10-29 08:12:00 -07:00
Linus Torvalds	6e958d73c2	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Do less agressive buddy clearing sched: Disable SD_PREFER_LOCAL for MC/CPU domains	2009-10-29 08:10:38 -07:00
Alexey Dobriyan	8c85dd8730	sysctl: fix false positives when PROC_SYSCTL=n Having ->procname but not ->proc_handler is valid when PROC_SYSCTL=n, people use such combination to reduce ifdefs with non-standard handlers. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14408 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Peter Teoh <htmldeveloper@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-29 07:39:30 -07:00
KOSAKI Motohiro	478988d3b2	cgroup: fix strstrip() misuse cgroup_write_X64() and cgroup_write_string() ignore the return value of strstrip(). it makes small inconsistent behavior. example: ========================= # cd /mnt/cgroup/hoge # cat memory.swappiness 60 # echo "59 " > memory.swappiness # cat memory.swappiness 59 # echo " 58" > memory.swappiness bash: echo: write error: Invalid argument This patch fixes it. Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-29 07:39:25 -07:00
Christian Borntraeger	0d0df599f1	connector: fix regression introduced by sid connector Since commit `02b51df1b0` (proc connector: add event for process becoming session leader) we have the following warning: Badness at kernel/softirq.c:143 [...] Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0) [...] Call Trace: ([<000000013fe04100>] 0x13fe04100) [<000000000048a946>] sk_filter+0x9a/0xd0 [<000000000049d938>] netlink_broadcast+0x2c0/0x53c [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0 [<00000000003baef0>] proc_sid_connector+0xc4/0xd4 [<0000000000142604>] __set_special_pids+0x58/0x90 [<0000000000159938>] sys_setsid+0xb4/0xd8 [<00000000001187fe>] sysc_noemu+0x10/0x16 [<00000041616cb266>] 0x41616cb266 The warning is ---> WARN_ON_ONCE(in_irq() \|\| irqs_disabled()); The network code must not be called with disabled interrupts but sys_setsid holds the tasklist_lock with spinlock_irq while calling the connector. After a discussion we agreed that we can move proc_sid_connector from __set_special_pids to sys_setsid. We also agreed that it is sufficient to change the check from task_session(curr) != pid into err > 0, since if we don't change the session, this means we were already the leader and return -EPERM. One last thing: There is also daemonize(), and some people might want to get a notification in that case. Since daemonize() is only needed if a user space does kernel_thread this does not look important (and there seems to be no consensus if this connector should be called in daemonize). If we really want this, we can add proc_sid_connector to daemonize() in an additional patch (Scott?) Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Scott James Remnant <scott@ubuntu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: David S. Miller <davem@davemloft.net> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-29 07:39:25 -07:00
Rusty Russell	dd17c8f729	percpu: remove per_cpu__ prefix. Now that the return from alloc_percpu is compatible with the address of per-cpu vars, it makes sense to hand around the address of per-cpu variables. To make this sane, we remove the per_cpu__ prefix we used created to stop people accidentally using these vars directly. Now we have sparse, we can use that (next patch). tj: * Updated to convert stuff which were missed by or added after the original patch. * Kill per_cpu_var() macro. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>	2009-10-29 22:34:15 +09:00
Tejun Heo	9705f69ed0	percpu: make percpu symbols in tracer unique This patch updates percpu related symbols in kernel tracer such that percpu symbols are unique and don't clash with local symbols. This serves two purposes of decreasing the possibility of global percpu symbol collision and allowing dropping per_cpu__ prefix from percpu symbols. * kernel/trace/trace.c: s/max_data/max_tr_data/ * kernel/trace/trace_hw_branches: s/tracer/hwb_tracer/, s/buffer/hwb_buffer/ Partly based on Rusty Russell's "alloc_percpu: rename percpu vars which cause name clashes" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com>	2009-10-29 22:34:13 +09:00
Tejun Heo	1871e52c76	percpu: make percpu symbols under kernel/ and mm/ unique This patch updates percpu related symbols under kernel/ and mm/ such that percpu symbols are unique and don't clash with local symbols. This serves two purposes of decreasing the possibility of global percpu symbol collision and allowing dropping per_cpu__ prefix from percpu symbols. * kernel/lockdep.c: s/lock_stats/cpu_lock_stats/ * kernel/sched.c: s/init_rq_rt/init_rt_rq_var/ (any better idea?) s/sched_group_cpus/sched_groups/ * kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a * kernel/softlockup.c: s/()_timestamp/softlockup_\1_ts/ s/watchdog_task/softlockup_watchdog/ s/timestamp/ts/ for local variables kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/ * mm/slab.c: s/reap_work/slab_reap_work/ s/reap_node/slab_reap_node/ * mm/vmstat.c: local variable changed to avoid collision with vmstat_work Partly based on Rusty Russell's "alloc_percpu: rename percpu vars which cause name clashes" patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: (slab/vmstat) Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de>	2009-10-29 22:34:13 +09:00
Ingo Molnar	9de09ace8d	Merge branch 'tracing/urgent' into tracing/core Merge reason: Pick up fixes and move base from -rc1 to -rc5. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-29 09:02:20 +01:00
Li Zefan	3ed67776fc	tracing/filters: Fix to make system filter work commit `fce29d15b5` ("tracing/filters: Refactor subsystem filter code") broke system filter accidentally. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4AE810BD.3070009@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-29 08:53:20 +01:00
Masami Hiramatsu	dd004c475c	kprobe-tracer: Compare both of event-name and event-group to find probe Fix find_probe_event() to compare both of event-name and event-group. Without this fix, kprobe-tracer overwrites existing same event-name probe even if its group-name is different. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> LKML-Reference: <20091027204244.30545.27516.stgit@harusame> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-29 08:47:47 +01:00
Rusty Russell	3c7d76e371	param: fix setting arrays of bool We create a dummy struct kernel_param on the stack for parsing each array element, but we didn't initialize the flags word. This matters for arrays of type "bool", where the flag indicates if it really is an array of bools or unsigned int (old-style). Reported-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org	2009-10-29 08:56:20 +10:30
Rusty Russell	d553ad864e	param: fix NULL comparison on oom kp->arg is always true: it's the contents of that pointer we care about. Reported-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org	2009-10-29 08:56:18 +10:30
Rusty Russell	65afac7d80	param: fix lots of bugs with writing charp params from sysfs, by leaking mem. `e180a6b775` "param: fix charp parameters set via sysfs" fixed the case where charp parameters written via sysfs were freed, leaving drivers accessing random memory. Unfortunately, storing a flag in the kparam struct was a bad idea: it's rodata so setting it causes an oops on some archs. But that's not all: 1) module_param_array() on charp doesn't work reliably, since we use an uninitialized temporary struct kernel_param. 2) there's a fundamental race if a module uses this parameter and then it's changed: they will still access the old, freed, memory. The simplest fix (ie. for 2.6.32) is to never free the memory. This prevents all these problems, at cost of a memory leak. In practice, there are only 18 places where a charp is writable via sysfs, and all are root-only writable. Reported-by: Takashi Iwai <tiwai@suse.de> Cc: Sitsofe Wheeler <sitsofe@yahoo.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christof Schmitt <christof.schmitt@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org	2009-10-29 08:56:17 +10:30
Tejun Heo	be404f0212	PM / freezer: Don't get over-anxious while waiting Freezing isn't exactly the most latency sensitive operation and there's no reason to burn cpu cycles and power waiting for it to complete. msleep(10) instead of yield(). This should improve reliability of emergency hibernation. [rjw: Modified the comment next to the msleep(10).] Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-10-28 22:53:09 +01:00
Thomas Gleixner	11df6dddcb	futex: Fix spurious wakeup for requeue_pi really The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr == NULL test) nor does it use the wake_list of futex_wake() which where the reason for commit 41890f2 (futex: Handle spurious wake up) See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com> The changes in this fix to the wait_requeue_pi path were considered to be a likely unecessary, but harmless safety net. But it turns out that due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined as EAGAIN we built an endless loop in the code path which returns correctly EWOULDBLOCK. Spurious wakeups in wait_requeue_pi code path are unlikely so we do the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let it deal with the spurious wakeup. Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> LKML-Reference: <4AE23C74.1090502@us.ibm.com> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-10-28 20:34:34 +01:00
Catalin Marinas	a6f5aa1ea0	kmemleak: Scan the _ftrace_events section in modules This section contains pointers to allocated objects and not scanning it leads to false positives. Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2009-10-28 17:07:54 +00:00
Jiri Kosina	4a6cc4bd32	sched: move rq_weight data array out of .percpu Commit `34d76c41` introduced percpu array update_shares_data, size of which being proportional to NR_CPUS. Unfortunately this blows up ia64 for large NR_CPUS configuration, as ia64 allows only 64k for .percpu section. Fix this by allocating this array dynamically and keep only pointer to it percpu. The per-cpu handling doesn't impose significant performance penalty on potentially contented path in tg_shares_up(). ... ffffffff8104337c: 65 48 8b 14 25 20 cd mov %gs:0xcd20,%rdx ffffffff81043383: 00 00 ffffffff81043385: 48 c7 c0 00 e1 00 00 mov $0xe100,%rax ffffffff8104338c: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp) ffffffff81043393: 00 ffffffff81043394: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp) ffffffff8104339b: 00 ffffffff8104339c: 48 01 d0 add %rdx,%rax ffffffff8104339f: 49 8d 94 24 08 01 00 lea 0x108(%r12),%rdx ffffffff810433a6: 00 ffffffff810433a7: b9 ff ff ff ff mov $0xffffffff,%ecx ffffffff810433ac: 48 89 45 b0 mov %rax,-0x50(%rbp) ffffffff810433b0: bb 00 04 00 00 mov $0x400,%ebx ffffffff810433b5: 48 89 55 c0 mov %rdx,-0x40(%rbp) ... After: ... ffffffff8104337c: 65 8b 04 25 28 cd 00 mov %gs:0xcd28,%eax ffffffff81043383: 00 ffffffff81043384: 48 98 cltq ffffffff81043386: 49 8d bc 24 08 01 00 lea 0x108(%r12),%rdi ffffffff8104338d: 00 ffffffff8104338e: 48 8b 15 d3 7f 76 00 mov 0x767fd3(%rip),%rdx # ffffffff817ab368 <update_shares_data> ffffffff81043395: 48 8b 34 c5 00 ee 6d mov -0x7e921200(,%rax,8),%rsi ffffffff8104339c: 81 ffffffff8104339d: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp) ffffffff810433a4: 00 ffffffff810433a5: b9 ff ff ff ff mov $0xffffffff,%ecx ffffffff810433aa: 48 89 7d c0 mov %rdi,-0x40(%rbp) ffffffff810433ae: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp) ffffffff810433b5: 00 ffffffff810433b6: bb 00 04 00 00 mov $0x400,%ebx ffffffff810433bb: 48 01 f2 add %rsi,%rdx ffffffff810433be: 48 89 55 b0 mov %rdx,-0x50(%rbp) ... Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Tejun Heo <tj@kernel.org>	2009-10-29 00:26:00 +09:00
Catalin Marinas	c017b4be3e	kmemleak: Simplify the kmemleak_scan_area() function prototype This function was taking non-necessary arguments which can be determined by kmemleak. The patch also modifies the calling sites. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au>	2009-10-28 15:11:00 +00:00
Anton Blanchard	f7d7986060	perf_event: Add alignment-faults and emulation-faults software events Add two more software events that are common to many cpus. Alignment faults: When a load or store is not aligned properly. Emulation faults: When an instruction is emulated in software. Both cause a very significant slowdown (100x or worse), so identifying and fixing them is very important. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2009-10-28 16:13:03 +11:00
Peter Zijlstra	88b91c7ca4	rcu: Simplify creating of lockdep class for root rcu_node Use lockdep_set_class() to simplify the code and to avoid any additional overhead in the !LOCKDEP case. Also move the definition of rcu_root_class into kernel/rcutree.c, as suggested by Lai Jiangshan. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1256577871443-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-26 21:07:16 +01:00
Ingo Molnar	4ce5b90340	rcu: Do tiny cleanups in rcutiny No change in functionality - just straighten out a few small stylistic details. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: avi@redhat.com Cc: mtosatti@redhat.com LKML-Reference: <12565226351355-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-26 09:40:40 +01:00
Paul E. McKenney	cf886c44ec	rcu: Improve rcutorture diagnostics when bad torture_type specified Make rcutorture list the available torture_type values when it doesn't like the one specified. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: avi@redhat.com Cc: mtosatti@redhat.com LKML-Reference: <12565226351868-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-26 09:40:31 +01:00
Paul E. McKenney	804bb83705	rcu: Add synchronize_srcu_expedited() to the rcutorture test suite Adds the "srcu_expedited" torture type, and also renames sched_ops_sync to sched_sync_ops for consistency while we are in this file. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: avi@redhat.com Cc: mtosatti@redhat.com LKML-Reference: <12565226353636-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-26 09:40:30 +01:00
Paul E. McKenney	0cd397d336	rcu: Add synchronize_srcu_expedited() This patch creates a synchronize_srcu_expedited() that uses synchronize_sched_expedited() where synchronize_srcu() uses synchronize_sched(). The synchronize_srcu() and synchronize_srcu_expedited() functions become one-liners that pass synchronize_sched() or synchronize_sched_expedited(), repectively, to a new __synchronize_srcu() function. While in the file, move the EXPORT_SYMBOL_GPL()s to immediately follow the corresponding functions. Requested-by: Avi Kivity <avi@redhat.com> Tested-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: avi@redhat.com LKML-Reference: <12565226354038-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-26 09:40:30 +01:00
Paul E. McKenney	9b1d82fa16	rcu: "Tiny RCU", The Bloatwatch Edition This patch is a version of RCU designed for !SMP provided for a small-footprint RCU implementation. In particular, the implementation of synchronize_rcu() is extremely lightweight and high performance. It passes rcutorture testing in each of the four relevant configurations (combinations of NO_HZ and PREEMPT) on x86. This saves about 1K bytes compared to old Classic RCU (which is no longer in mainline), and more than three kilobytes compared to Hierarchical RCU (updated to 2.6.30): CONFIG_TREE_RCU: text data bss dec filename 183 4 0 187 kernel/rcupdate.o 2783 520 36 3339 kernel/rcutree.o 3526 Total (vs 4565 for v7) CONFIG_TREE_PREEMPT_RCU: text data bss dec filename 263 4 0 267 kernel/rcupdate.o 4594 776 52 5422 kernel/rcutree.o 5689 Total (6155 for v7) CONFIG_TINY_RCU: text data bss dec filename 96 4 0 100 kernel/rcupdate.o 734 24 0 758 kernel/rcutiny.o 858 Total (vs 848 for v7) The above is for x86. Your mileage may vary on other platforms. Further compression is possible, but is being procrastinated. Changes from v7 (http://lkml.org/lkml/2009/10/9/388) o Apply Lai Jiangshan's review comments (aside from might_sleep() in synchronize_sched(), which is covered by SMP builds). o Fix up expedited primitives. Changes from v6 (http://lkml.org/lkml/2009/9/23/293). o Forward ported to put it into the 2.6.33 stream. o Added lockdep support. o Make lightweight rcu_barrier. Changes from v5 (http://lkml.org/lkml/2009/6/23/12). o Ported to latest pre-2.6.32 merge window kernel. - Renamed rcu_qsctr_inc() to rcu_sched_qs(). - Renamed rcu_bh_qsctr_inc() to rcu_bh_qs(). - Provided trivial rcu_cpu_notify(). - Provided trivial exit_rcu(). - Provided trivial rcu_needs_cpu(). - Fixed up the rcu_*_enter/exit() functions in linux/hardirq.h. o Removed the dependence on EMBEDDED, with a view to making TINY_RCU default for !SMP at some time in the future. o Added (trivial) support for expedited grace periods. Changes from v4 (http://lkml.org/lkml/2009/5/2/91) include: o Squeeze the size down a bit further by removing the ->completed field from struct rcu_ctrlblk. o This permits synchronize_rcu() to become the empty function. Previous concerns about rcutorture were unfounded, as rcutorture correctly handles a constant value from rcu_batches_completed() and rcu_batches_completed_bh(). Changes from v3 (http://lkml.org/lkml/2009/3/29/221) include: o Changed rcu_batches_completed(), rcu_batches_completed_bh() rcu_enter_nohz(), rcu_exit_nohz(), rcu_nmi_enter(), and rcu_nmi_exit(), to be static inlines, as suggested by David Howells. Doing this saves about 100 bytes from rcutiny.o. (The numbers between v3 and this v4 of the patch are not directly comparable, since they are against different versions of Linux.) Changes from v2 (http://lkml.org/lkml/2009/2/3/333) include: o Fix whitespace issues. o Change short-circuit "\|\|" operator to instead be "+" in order to fix performance bug noted by "kraai" on LWN. (http://lwn.net/Articles/324348/) Changes from v1 (http://lkml.org/lkml/2009/1/13/440) include: o This version depends on EMBEDDED as well as !SMP, as suggested by Ingo. o Updated rcu_needs_cpu() to unconditionally return zero, permitting the CPU to enter dynticks-idle mode at any time. This works because callbacks can be invoked upon entry to dynticks-idle mode. o Paul is now OK with this being included, based on a poll at the Kernel Miniconf at linux.conf.au, where about ten people said that they cared about saving 900 bytes on single-CPU systems. o Applies to both mainline and tip/core/rcu. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: avi@redhat.com Cc: mtosatti@redhat.com LKML-Reference: <12565226351355-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-26 09:40:29 +01:00
Avi Kivity	7a04109751	x86: Fix user return notifier build When CONFIG_USER_RETURN_NOTIFIER is set, we need to link kernel/user-return-notifier.o. Signed-off-by: Avi Kivity <avi@redhat.com> LKML-Reference: <1256473485-23109-1-git-send-email-avi@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-25 17:37:57 +01:00
Ryota Ozaki	ce0e7b28fb	sched, cpuacct: Fix niced guest time accounting CPU time of a guest is always accounted in 'user' time without concern for the nice value of its counterpart process although the guest is scheduled under the nice value. This patch fixes the defect and accounts cpu time of a niced guest in 'nice' time as same as a niced process. And also the patch adds 'guest_nice' to cpuacct. The value provides niced guest cpu time which is like 'nice' to 'user'. The original discussions can be found here: http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com> Acked-by: Avi Kivity <avi@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1256314810-7897-1-git-send-email-ozaki.ryota@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-25 17:31:30 +01:00
Ingo Molnar	0b9e31e926	Merge branch 'linus' into sched/core Conflicts: fs/proc/array.c Merge reason: resolve conflict and queue up dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-25 17:30:53 +01:00
Jiri Olsa	6d3f1e12f4	tracing: Remove cpu arg from the rb_time_stamp() function The cpu argument is not used inside the rb_time_stamp() function. Plus fix a typo. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233647.118547500@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-24 11:07:51 +02:00
Jiri Olsa	67b394f7f2	tracing: Fix comment typo and documentation example Trivial patch to fix a documentation example and to fix a comment. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.871719877@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-24 11:07:50 +02:00
Jiri Olsa	3e69533b51	tracing: Fix trace_seq_printf() return value trace_seq_printf() return value is a little ambiguous. It currently returns the length of the space available in the buffer. printf usually returns the amount written. This is not adequate here, because: trace_seq_printf(s, ""); is perfectly legal, and returning 0 would indicate that it failed. We can always see the amount written by looking at the before and after values of s->len. This is not quite the same use as printf. We only care if the string was successfully written to the buffer or not. Make trace_seq_printf() return 0 if the trace oversizes the buffer's free space, 1 otherwise. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.631787612@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-24 11:07:50 +02:00
Jiri Olsa	cf8517cf90	tracing: Update ppos instead of filp->f_pos Instead of directly updating filp->f_pos we should update the ppos argument. The filp->f_pos gets updated within the file_pos_write() function called from sys_write(). Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091023233646.399670810@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-24 11:07:49 +02:00
Mike Galbraith	f685ceacab	sched: Strengthen buddies and mitigate buddy induced latencies This patch restores the effectiveness of LAST_BUDDY in preventing pgsql+oltp from collapsing due to wakeup preemption. It also switches LAST_BUDDY to exclusively do what it does best, namely mitigate the effects of aggressive wakeup preemption, which improves vmark throughput markedly, and restores mysql+oltp scalability. Since buddies are about scalability, enable them beginning at the point where we begin expanding sched_latency, namely sched_nr_latency. Previously, buddies were cleared aggressively, which seriously reduced their effectiveness. Not clearing aggressively however, produces a small drop in mysql+oltp throughput immediately after peak, indicating that LAST_BUDDY is actually doing some harm. This is right at the point where X on the desktop in competition with another load wants low latency service. Ergo, do not enable until we need to scale. To mitigate latency induced by buddies, or by a task just missing wakeup preemption, check latency at tick time. Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via CACHE_HOT_BUDDY. Supporting performance tests: tip = v2.6.32-rc5-1497-ga525b32 tipx = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs (Three run averages except where noted.) vmark: ------ tip 108466 messages per second tip+ 125307 messages per second tip+x 125335 messages per second tipx 117781 messages per second 2.6.31.3 122729 messages per second mysql+oltp: ----------- clients 1 2 4 8 16 32 64 128 256 .......................................................................................... tip 9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08 tip+ 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47 tipx 9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18 2.6.31.3 8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86 pgsql+oltp: ----------- clients 1 2 4 8 16 32 64 128 256 .......................................................................................... tip 13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35 tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94 tip+x 13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00 tipx 13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84 2.6.31.3 13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25 Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-23 23:48:28 +02:00
Christian Borntraeger	5c82871335	ratelimit: Make suppressed output messages more useful Today I got: [39648.224782] Registered led device: iwl-phy0::TX [40676.545099] __ratelimit: 246 callbacks suppressed [40676.545103] abcdef[23675]: segfault at 0 ... as you can see the ratelimit message contains a function prefix. Since this is always __ratelimit, this wont help much. This patch changes __ratelimit and printk_ratelimit to print the function name that calls ratelimit. This will pinpoint the responsible function, as long as not several different places call ratelimit with the same ratelimit state at the same time. In that case we catch only one random function that calls ratelimit after the wait period. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Young <hidave.darkstar@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> CC: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <200910231458.11832.borntraeger@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-23 17:26:37 +02:00
Sheng Yang	72f279b256	generic-ipi: Fix misleading smp_call_function*() description After commit:8969a5ede0f9e17da4b943712429aef2c9bcd82b "generic-ipi: remove kmalloc()", wait = 0 can be guaranteed. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <npiggin@suse.de> LKML-Reference: <1256210374-25354-1-git-send-email-sheng@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-23 13:51:45 +02:00
Soeren Sandmann	54f4407608	perf events: Don't generate events for the idle task when exclude_idle is set Getting samples for the idle task is often not interesting, so don't generate them when exclude_idle is set for the event in question. Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <ye8pr8fmlq7.fsf@camel16.daimi.au.dk> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-23 09:35:02 +02:00
Soeren Sandmann	721a669b72	perf events: Fix swevent hrtimer sampling by keeping track of remaining time when enabling/disabling swevent hrtimers Make the hrtimer based events work for sysprof. Whenever a swevent is scheduled out, the hrtimer is canceled. When it is scheduled back in, the timer is restarted. This happens every scheduler tick, which means the timer never expired because it was getting repeatedly restarted over and over with the same period. To fix that, save the remaining time when disabling; when reenabling, use that saved time as the period instead of the user-specified sampling period. Also, move the starting and stopping of the hrtimers to helper functions instead of duplicating the code. Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com> LKML-Reference: <ye8vdi7mluz.fsf@camel16.daimi.au.dk> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-23 09:35:02 +02:00
Ingo Molnar	4331595650	Merge branch 'perf/core' into perf/probes Conflicts: tools/perf/Makefile Merge reason: - fix the conflict - pick up the pr_*() infrastructure to queue up dependent patch Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-23 08:23:20 +02:00
Rafael J. Wysocki	04bf7539c0	PM: Make warning in suspend_test_finish() less likely to happen Increase TEST_SUSPEND_SECONDS to 10 so the warning in suspend_test_finish() doesn't annoy the users of slower systems so much. Also, make the warning print the suspend-resume cycle time, so that we know why the warning actually triggered. Patch prepared during the hacking session at the Kernel Summit in Tokyo. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-22 08:23:45 +09:00
Ingo Molnar	c258449bc9	Merge branch 'perf/urgent' into perf/core Merge reason: Queue up dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-20 07:51:44 +02:00
Andi Kleen	65a6446434	HWPOISON: Allow schedule_on_each_cpu() from keventd Right now when calling schedule_on_each_cpu() from keventd there is a deadlock because it tries to schedule a work item on the current CPU too. This happens via lru_add_drain_all() in hwpoison. Just call the function for the current CPU in this case. This is actually faster too. Debugging with Fengguang Wu & Max Asbock Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-10-19 07:29:22 +02:00
Frederic Weisbecker	0f8f86c7bd	Merge commit 'perf/core' into perf/hw-breakpoint Conflicts: kernel/Makefile kernel/trace/Makefile kernel/trace/trace.h samples/Makefile Merge reason: We need to be uptodate with the perf events development branch because we plan to rewrite the breakpoints API on top of perf events.	2009-10-18 01:12:33 +02:00
Ingo Molnar	bb3c3e8071	Merge commit 'v2.6.32-rc5' into perf/probes Conflicts: kernel/trace/trace_event_profile.c Merge reason: update to -rc5 and resolve conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-17 09:58:25 +02:00
Masami Hiramatsu	e63cc2397e	tracing/kprobes: Add failure messages for debugging Add verbose failure messages to kprobe-tracer for debugging. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20091017000728.16556.16713.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-17 09:53:58 +02:00
Masami Hiramatsu	f397af06e4	tracing/kprobes: Update kprobe-tracer selftest against new syntax Update kprobe-tracer selftest since command syntax has been changed. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20091017000720.16556.26343.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-17 09:53:57 +02:00
Darren Hart	89061d3d58	futex: Move drop_futex_key_refs out of spinlock'ed region When requeuing tasks from one futex to another, the reference held by the requeued task to the original futex location needs to be dropped eventually. Dropping the reference may ultimately lead to a call to "iput_final" and subsequently call into filesystem- specific code - which may be non-atomic. It is therefore safer to defer this drop operation until after the futex_hash_bucket spinlock has been dropped. Originally-From: Helge Bahmann <hcb@chaoticmind.net> Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> Cc: Sven-Thorsten Dietrich <sdietrich@novell.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <4AD7A298.5040802@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-16 10:19:18 +02:00
Linus Torvalds	bd0704111e	Merge the right tty-fixes branch * branch 'tty-fixes' tty: use the new 'flush_delayed_work()' helper to do ldisc flush workqueue: add 'flush_delayed_work()' to run and wait for delayed work tty: Make flush_to_ldisc() locking more robust	2009-10-15 14:59:24 -07:00
Paul E. McKenney	237c80c5c8	rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang If the following sequence of events occurs, then TREE_PREEMPT_RCU will hang waiting for a grace period to complete, eventually OOMing the system: o A TREE_PREEMPT_RCU build of the kernel is booted on a system with more than 64 physical CPUs present (32 on a 32-bit system). Alternatively, a TREE_PREEMPT_RCU build of the kernel is booted with RCU_FANOUT set to a sufficiently small value that the physical CPUs populate two or more leaf rcu_node structures. o A task is preempted in an RCU read-side critical section while running on a CPU corresponding to a given leaf rcu_node structure. o All CPUs corresponding to this same leaf rcu_node structure record quiescent states for the current grace period. o All of these same CPUs go offline (hence the need for enough physical CPUs to populate more than one leaf rcu_node structure). This causes the preempted task to be moved to the root rcu_node structure. At this point, there is nothing left to cause the quiescent state to be propagated up the rcu_node tree, so the current grace period never completes. The simplest fix, especially after considering the deadlock possibilities, is to detect this situation when the last CPU is offlined, and to set that CPU's ->qsmask bit in its leaf rcu_node structure. This will cause the next invocation of force_quiescent_state() to end the grace period. Without this fix, this hang can be triggered in an hour or so on some machines with rcutorture and random CPU onlining/offlining. With this fix, these same machines pass a full 10 hours of this sort of abuse. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20091015162614.GA19131@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 20:33:01 +02:00
Ingo Molnar	a66abe7fbf	tracing/events: Fix locking imbalance in the filter code Américo Wang noticed that we have a locking imbalance in the error paths of ftrace_profile_set_filter(), causing potential leakage of event_mutex. Also clean up other error codepaths related to event_mutex while at it. Plus fix an initialized variable in the subsystem filter code. Reported-by: Américo Wang <xiyou.wangcong@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <2375c9f90910150247u5ccb8e2at58c764e385ffa490@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 12:41:56 +02:00
Li Zefan	6fb2915df7	tracing/profile: Add filter support - Add an ioctl to allocate a filter for a perf event. - Free the filter when the associated perf event is to be freed. - Do the filtering in perf_swevent_match(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4AD69546.8050401@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 11:35:23 +02:00
Li Zefan	b0f1a59a98	tracing/filters: Use a different op for glob match "==" will always do a full match, and "~" will do a glob match. In the future, we may add "=~" for regex match. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4AD69528.3050309@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 11:35:22 +02:00
Li Zefan	fce29d15b5	tracing/filters: Refactor subsystem filter code Change: for_each_pred for_each_subsystem To: for_each_subsystem for_each_pred This change also prepares for later patches. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4AD69502.8060903@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 11:35:22 +02:00
Ingo Molnar	713490e02e	Merge branch 'tracing/core' into perf/core Merge reason: to add event filter support we need the following commits from the tracing tree: `3f6fe06`: tracing/filters: Unify the regex parsing helpers `1889d20`: tracing/filters: Provide basic regex support `737f453`: tracing/filters: Cleanup useless headers Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 11:34:00 +02:00
Paul E. McKenney	3397e040df	rcu: Add rnp->blocked_tasks to tracing Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: npiggin@suse.de Cc: jens.axboe@oracle.com Cc: Josh Triplett <josh@joshtriplett.org> LKML-Reference: <20091014233638.GE6763@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> kernel/rcutree_trace.c \| 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)	2009-10-15 11:20:22 +02:00
Paul E. McKenney	019129d595	rcu: Stopgap fix for synchronize_rcu_expedited() for TREE_PREEMPT_RCU For the short term, map synchronize_rcu_expedited() to synchronize_rcu() for TREE_PREEMPT_RCU and to synchronize_sched_expedited() for TREE_RCU. Longer term, there needs to be a real expedited grace period for TREE_PREEMPT_RCU, but candidate patches to date are considerably more complex and intrusive. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: npiggin@suse.de Cc: jens.axboe@oracle.com LKML-Reference: <12555405592331-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 11:17:17 +02:00
Paul E. McKenney	37c72e56f6	rcu: Prevent RCU IPI storms in presence of high call_rcu() load As the number of callbacks on a given CPU rises, invoke force_quiescent_state() only every blimit number of callbacks (defaults to 10,000), and even then only if no other CPU has invoked force_quiescent_state() in the meantime. This should fix the performance regression reported by Nick. Reported-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: jens.axboe@oracle.com LKML-Reference: <12555405592133-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 11:17:16 +02:00
Ingo Molnar	b226f744d4	Merge branch 'linus' into perf/core Merge reason: pick up tools/perf/ changes from upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-15 08:44:44 +02:00
Linus Torvalds	d6047d79b9	Merge branch 'tty-fixes' * branch 'tty-fixes': tty: use the new 'flush_delayed_work()' helper to do ldisc flush workqueue: add 'flush_delayed_work()' to run and wait for delayed work Make flush_to_ldisc properly handle parallel calls	2009-10-14 15:34:55 -07:00
Linus Torvalds	ee67e6cbe1	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: oprofile: warn on freeing event buffer too early oprofile: fix race condition in event_buffer free lockdep: Use cpu_clock() for lockstat	2009-10-14 15:25:35 -07:00
Linus Torvalds	f061d83a2b	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix missing kernel-doc notation Revert "x86, timers: Check for pending timers after (device) interrupts" sched: Update the clock of runqueue select_task_rq() selected	2009-10-14 15:25:04 -07:00
Linus Torvalds	e345fe1ada	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/filters: Fix memory leak when setting a filter tracing: fix trace_vprintk call	2009-10-14 15:24:51 -07:00
Linus Torvalds	8c53e46314	workqueue: add 'flush_delayed_work()' to run and wait for delayed work It basically turns a delayed work into an immediate work, and then waits for it to finish, thus allowing you to force (and wait for) an immediate flush of a delayed work. We'll want to use this in the tty layer to clean up tty_flush_to_ldisc(). Acked-by: Oleg Nesterov <oleg@redhat.com> [ Fixed to use 'del_timer_sync()' as noted by Oleg ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-14 15:11:35 -07:00
Darren Hart	2bc872036e	futex: Check for NULL keys in match_futex If userspace tries to perform a requeue_pi on a non-requeue_pi waiter, it will find the futex_q->requeue_pi_key to be NULL and OOPS. Check for NULL in match_futex() instead of doing explicit NULL pointer checks on all call sites. While match_futex(NULL, NULL) returning false is a little odd, it's still correct as we expect valid key references. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Dinakar Guniguntala <dino@in.ibm.com> CC: John Stultz <johnstul@us.ibm.com> Cc: stable@kernel.org LKML-Reference: <4AD60687.10306@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-10-14 22:00:14 +02:00
Frederic Weisbecker	1beee96bae	ftrace: Rename set_bootup_ftrace into set_cmdline_ftrace set_cmdline_ftrace is a better match against what does this function: apply a tracer name from the kernel command line. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-10-14 20:55:55 +02:00
Frederic Weisbecker	06f43d66ec	ftrace: Copy ftrace_graph_filter boot param using strlcpy We are using strncpy in the wrong way to copy the ftrace_graph_filter boot param because we pass the buffer size instead of the max string size it can contain (buffer size - 1). The end result might not be NULL terminated as we are abusing the max string size. Lets use strlcpy() instead. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>	2009-10-14 20:43:39 +02:00
Linus Torvalds	43046b6066	workqueue: add 'flush_delayed_work()' to run and wait for delayed work It basically turns a delayed work into an immediate work, and then waits for it to finish.	2009-10-14 09:16:42 -07:00
Thomas Gleixner	6f15fa5008	sys: Remove BKL from sys_reboot Serialization of sys_reboot can be done local. The BKL is not protecting anything else. LKML-Reference: <20091010153349.405590702@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-10-14 15:31:10 +02:00
Jonathan Corbet	1a6deaea35	pm_qos: clean up racy global "name" variable "name" is a poor name for a file-global variable. It was used in three different functions, with no mutual exclusion. But it's just a tiny, temporary string; let's just move it onto the stack in the functions that need it. Also use snprintf() just in case. Signed-off-by: Jonathan Corbet <corbet@lwn.net> LKML-Reference: <20091010153349.113570550@linutronix.de> Acked-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-10-14 15:31:10 +02:00
Jonathan Corbet	e6fe07a014	pm_qos: remove BKL pm_qos_power_open got its lock_kernel() calls from the open() pushdown. A look at the code shows that the only global resources accessed are pm_qos_array and "name". pm_qos_array doesn't change (things pointed to therein do change, but they are atomics and/or are protected by pm_qos_lock). Accesses to "name" are totally unprotected with or without the BKL; that will be fixed shortly. The BKL is not helpful here; take it out. Signed-off-by: Jonathan Corbet <corbet@lwn.net> LKML-Reference: <20091010153349.071381158@linutronix.de> Acked-by: Mark Gross <mgross@linux.intel.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-10-14 15:31:10 +02:00
Peter Zijlstra	92f6a5e37a	sched: Do less agressive buddy clearing Yanmin reported a hackbench regression due to: > commit `de69a80be3` > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Thu Sep 17 09:01:20 2009 +0200 > > sched: Stop buddies from hogging the system I really liked `de69a80b`, and it affecting hackbench shows I wasn't crazy ;-) So hackbench is a multi-cast, with one sender spraying multiple receivers, who in their turn don't spray back. This would be exactly the scenario that patch 'cures'. Previously we would not clear the last buddy after running the next task, allowing the sender to get back to work sooner than it otherwise ought to have been, increasing latencies for other tasks. Now, since those receivers don't poke back, they don't enforce the buddy relation, which means there's nothing to re-elect the sender. Cure this by less agressively clearing the buddy stats. Only clear buddies when they were not chosen. It should still avoid a buddy sticking around long after its served its time. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Mike Galbraith <efault@gmx.de> LKML-Reference: <1255084986.8802.46.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-14 15:02:34 +02:00
Frederic Weisbecker	c44fc77084	tracing: Move syscalls metadata handling from arch to core Most of the syscalls metadata processing is done from arch. But these operations are mostly generic accross archs. Especially now that we have a common variable name that expresses the number of syscalls supported by an arch: NR_syscalls, the only remaining bits that need to reside in arch is the syscall nr to addr translation. v2: Compare syscalls symbols only after the "sys" prefix so that we avoid spurious mismatches with archs that have syscalls wrappers, in which case syscalls symbols have "SyS" prefixed aliases. (Reported by: Heiko Carstens) Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org>	2009-10-14 09:53:56 +02:00
Paul Mackerras	03541f8b69	perf_event: Adjust frequency and unthrottle for non-group-leader events The loop in perf_ctx_adjust_freq checks the frequency of sampling event counters, and adjusts the event interval and unthrottles the event if required, and resets the interrupt count for the event. However, at present it only looks at group leaders. This means that a sampling event that is not a group leader will eventually get throttled, once its interrupt count reaches sysctl_perf_event_sample_rate/HZ --- and that is guaranteed to happen, if the event is active for long enough, since the interrupt count never gets reset. Once it is throttled it never gets unthrottled, so it basically just stops working at that point. This fixes it by making perf_ctx_adjust_freq use ctx->event_list rather than ctx->group_list. The existing spin_lock/spin_unlock around the loop makes it unnecessary to put rcu_read_lock/ rcu_read_unlock around the list_for_each_entry_rcu(). Reported-by: Mark W. Krentel <krentel@cs.rice.edu> Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <19157.26731.855609.165622@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-14 08:39:32 +02:00
Jiri Olsa	5cb084bb1f	tracing: Enable records during the module load I was debuging some module using "function" and "function_graph" tracers and noticed, that if you load module after you enabled tracing, the module's hooks will convert only to NOP instructions. The attached patch enables modules' hooks if there's function trace allready on, thus allowing to trace module functions. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20091013203425.896285120@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-14 08:13:54 +02:00
jolsa@redhat.com	756d17ee7e	tracing: Support multiple pids in set_pid_ftrace file Adding the possibility to set more than 1 pid in the set_pid_ftrace file, thus allowing to trace more than 1 independent processes. Usage: sh-4.0# echo 284 > ./set_ftrace_pid sh-4.0# cat ./set_ftrace_pid 284 sh-4.0# echo 1 >> ./set_ftrace_pid sh-4.0# echo 0 >> ./set_ftrace_pid sh-4.0# cat ./set_ftrace_pid swapper tasks 1 284 sh-4.0# echo 4 > ./set_ftrace_pid sh-4.0# cat ./set_ftrace_pid 4 sh-4.0# echo > ./set_ftrace_pid sh-4.0# cat ./set_ftrace_pid no pid sh-4.0# Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091013203425.565454612@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-14 08:13:53 +02:00
Arjan van de Ven	825332e4ff	capabilities: simplify bound checks for copy_from_user() The capabilities syscall has a copy_from_user() call where gcc currently cannot prove to itself that the copy is always within bounds. This patch adds a very explicity bound check to prove to gcc that this copy_from_user cannot overflow its destination buffer. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>	2009-10-14 08:17:36 +11:00
Thomas Gleixner	d58e6576b0	futex: Handle spurious wake up The futex code does not handle spurious wake up in futex_wait and futex_wait_requeue_pi. The code assumes that any wake up which was not caused by futex_wake / requeue or by a timeout was caused by a signal wake up and returns one of the syscall restart error codes. In case of a spurious wake up the signal delivery code which deals with the restart error codes is not invoked and we return that error code to user space. That causes applications which actually check the return codes to fail. Blaise reported that on preempt-rt a python test program run into a exception trap. -rt exposed that due to a built in spurious wake up accelerator :) Solve this by checking signal_pending(current) in the wake up path and handle the spurious wake up case w/o returning to user space. Reported-by: Blaise Gassend <blaise@willowgarage.com> Debugged-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org LKML-Reference: <new-submission>	2009-10-13 20:40:43 +02:00
Linus Torvalds	80f506918f	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: Add cciss_allow_hpsa module parameter cciss: Fix multiple calls to pci_release_regions blk-settings: fix function parameter kernel-doc notation writeback: kill space in debugfs item name writeback: account IO throttling wait as iowait elv_iosched_store(): fix strstrip() misuse cfq-iosched: avoid probable slice overrun when idling cfq-iosched: apply bool value where we return 0/1 cfq-iosched: fix think time allowed for seekers cfq-iosched: fix the slice residual sign cfq-iosched: abstract out the 'may this cfqq dispatch' logic block: use proper BLK_RW_ASYNC in blk_queue_start_tag() block: Seperate read and write statistics of in_flight requests v2 block: get rid of kblock_schedule_delayed_work() cfq-iosched: fix possible problem with jiffies wraparound cfq-iosched: fix issue with rq-rq merging and fifo list ordering	2009-10-13 10:21:33 -07:00
Tejun Heo	dec54bf538	this_cpu: Use this_cpu_xx in trace_functions_graph.c ftrace_cpu_disabled usage in trace_functions_graph.c were left out during this_cpu_xx conversion in commit `9288f99a` causing compile failure. Convert them. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christoph Lameter <cl@linux-foundation.org>	2009-10-13 23:23:02 +09:00
Ingo Molnar	1bac0497ef	Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core	2009-10-13 12:03:08 +02:00
Frederic Weisbecker	bf7c5b43a1	tracing: Remove unused ftrace_trace_addr helper Remove the ftrace_trace_addr() function as only its off-case is implemented and there are no users of it currently. But we keep ftrace_graph_addr() off-case, in case someone come to use the function graph tracer to profit from top-level callers filtering. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-10-13 09:33:40 +02:00
Frederic Weisbecker	aef6f81b55	tracing: Rename set_ftrace to set_bootup_ftrace Do this rename because set_ftrace is too much generic and not enough self-explainable as a name. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-10-13 09:32:57 +02:00
Ingo Molnar	9dbdd6c41c	Merge commit 'v2.6.32-rc4' into perf/core Merge reason: we were on an -rc1 base, merge up to -rc4. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-13 09:31:34 +02:00
Ingo Molnar	2c96c142e9	Merge branch 'tracing/urgent' into tracing/core Merge reason: Pick up tracing/filters fix from the urgent queue, we will queue up dependent patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-13 09:24:59 +02:00
Arnaldo Carvalho de Melo	a2e2725541	net: Introduce recvmmsg socket syscall Meaning receive multiple messages, reducing the number of syscalls and net stack entry/exit operations. Next patches will introduce mechanisms where protocols that want to optimize this operation will provide an unlocked_recvmsg operation. This takes into account comments made by: . Paul Moore: sock_recvmsg is called only for the first datagram, sock_recvmsg_nosec is used for the rest. . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that works in the same fashion as the ppoll one. If the underlying protocol returns a datagram with MSG_OOB set, this will make recvmmsg return right away with as many datagrams (+ the OOB one) it has received so far. . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen datagrams and then recvmsg returns an error, recvmmsg will return the successfully received datagrams, store the error and return it in the next call. This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg, where we will be able to acquire the lock only at batch start and end, not at every underlying recvmsg call. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-12 23:40:10 -07:00
Li Zefan	8ad807318f	tracing/filters: Fix memory leak when setting a filter Every time we set a filter, we leak memory allocated by postfix_append_operand() and postfix_append_op(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: <stable@kernel.org> # for v2.6.31.x LKML-Reference: <4AD3D7D9.4070400@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-13 08:05:17 +02:00
Masami Hiramatsu	e93f4d8539	tracing/kprobes: Robustify fixed field names against variable field names conflicts Rename probe-common fixed field names to harder conflictable names, because current 'ip', 'func', and other probe field names are easily in conflict with user-specified variable names. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Frank Ch. Eigler <fche@redhat.com> LKML-Reference: <20091007222814.1684.407.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-12 23:31:51 +02:00
Masami Hiramatsu	a703d946e8	tracing/kprobes: Avoid field name confliction Check whether the argument name is in conflict with other field names while creating a kprobe through the debugfs interface. Changes in v3: - Check strcmp() == 0 instead of !strcmp(). Changes in v2: - Add common_lock_depth to reserved name list. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Frank Ch. Eigler <fche@redhat.com> LKML-Reference: <20091007222807.1684.26880.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-12 23:31:49 +02:00
Masami Hiramatsu	2e06ff6389	tracing/kprobes: Make special variable names more self-explainable Rename special variables to more self-explainable names as below: - $rv to $retval - $sa to $stack - $aN to $argN - $sN to $stackN Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Frank Ch. Eigler <fche@redhat.com> LKML-Reference: <20091007222759.1684.3319.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-12 23:30:29 +02:00
Stefan Assmann	369bc18f9a	ftrace: add kernel command line graph function filtering Add a command line parameter to allow limiting the function graphs that are traced on boot up from the given top-level callers , when ftrace=function_graph is specified. This patch adds the following command line option: ftrace_graph_filter=function-list Where function-list is a comma separated list of functions to filter. [fweisbec@gmail.com: picked the documentation changes from the v2 patch] Signed-off-by: Stefan Assmann <sassmann@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4AD2DEB9.2@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-12 22:17:21 +02:00
Masami Hiramatsu	99329c44f2	tracing/kprobes: Remove '$ra' special variable Remove '$ra' (return address) because it is already shown at the head of each entry. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Frank Ch. Eigler <fche@redhat.com> LKML-Reference: <20091007222748.1684.12711.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-12 19:24:05 +02:00
Masami Hiramatsu	405b2651e4	tracing/kprobes: Add $ prefix to special variables Add $ prefix to the special variables(e.g. sa, rv) of kprobe-tracer. This resolves consistency issues between kprobe_events and perf-kprobe. The main goal is to avoid conflicts between local variable names of probed functions, used by perf probe, and special variables used in the kprobe event creation interface (stack values, etc...) and also available from perf probe. ie: we don't want rv (return value) to conflict with a local variable named rv in a probed function. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Frank Ch. Eigler <fche@redhat.com> LKML-Reference: <20091007222740.1684.91170.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-12 19:21:35 +02:00
Christoph Lameter	9288f99aa5	this_cpu: Use this_cpu_xx for ftrace this_cpu_xx can reduce the instruction count here and also avoid address arithmetic. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Tejun Heo <tj@kernel.org>	2009-10-12 19:51:49 +09:00
Randy Dunlap	e17b38bf9e	sched: Fix missing kernel-doc notation The following htmldocs warnings: Warning(kernel/sched.c:685): No description found for parameter 'cpu' Warning(kernel/sched.c:3676): No description found for parameter 'sd' Trigger because new parameters were added to update_rq_clock() and update_group_power() without updating the kernel-doc notation. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4AD29070.7070002@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-12 10:50:06 +02:00
Alexey Dobriyan	d43c36dc6b	headers: remove sched.h from interrupt.h After m68k's task_thread_info() doesn't refer to current, it's possible to remove sched.h from interrupt.h and not break m68k! Many thanks to Heiko Carstens for allowing this. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-10-11 11:20:58 -07:00
Mike Galbraith	f5dc37530b	sched: Update the clock of runqueue select_task_rq() selected In try_to_wake_up(), we update the runqueue clock, but select_task_rq() may select a different runqueue than the one we updated, leaving the new runqueue's clock stale for a bit. This patch cures occasional huge latencies reported by latencytop when coming out of idle on a mostly idle NO_HZ box. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1255070103.7639.30.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-09 15:58:11 +02:00
Peter Zijlstra	3365e77987	lockdep: Use cpu_clock() for lockstat Some tracepoint magic (TRACE_EVENT(lock_acquired)) relies on the fact that lock hold times are positive and uses div64 on that. That triggered a build warning on MIPS, and probably causes bad output in certain circumstances as well. Make it truly positive. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1254818502.21044.112.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-09 15:56:44 +02:00
Wu Fengguang	d25105e891	writeback: account IO throttling wait as iowait It makes sense to do IOWAIT when someone is blocked due to IO throttle, as suggested by Kame and Peter. There is an old comment for not doing IOWAIT on throttle, however it has been mismatching the code for a long time. If we stop accounting IOWAIT for 2.6.32, it could be an undesirable behavior change. So restore the io_schedule. CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-09 12:40:42 +02:00
Steven Rostedt	a813a15976	tracing: fix trace_vprintk call The addition of trace_array_{v}printk used the wrong function for trace_vprintk to call. This broke trace_marker and trace_vprintk itself. Although trace_printk may not have been affected by those that end up calling trace_vbprintk. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-10-09 01:41:35 -04:00
Linus Torvalds	f579bbcd9b	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: fix requeue_pi key imbalance futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions rcu: Place root rcu_node structure in separate lockdep class rcu: Make hot-unplugged CPU relinquish its own RCU callbacks rcu: Move rcu_barrier() to rcutree futex: Move exit_pi_state() call to release_mm() futex: Nullify robust lists after cleanup futex: Fix locking imbalance panic: Fix panic message visibility by calling bust_spinlocks(0) before dying rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function rcu: Clean up code based on review feedback from Josh Triplett, part 4 rcu: Clean up code based on review feedback from Josh Triplett, part 3 rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y rcu: Clean up code to address Ingo's checkpatch feedback rcu: Clean up code based on review feedback from Josh Triplett, part 2 rcu: Clean up code based on review feedback from Josh Triplett	2009-10-08 12:16:35 -07:00
Linus Torvalds	e80fb7e52f	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Set correct normal_prio and prio values in sched_fork()	2009-10-08 12:07:24 -07:00
Linus Torvalds	f17f36bb1c	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: user local buffer variable for trace branch tracer tracing: fix warning on kernel/trace/trace_branch.c andtrace_hw_branches.c ftrace: check for failure for all conversions tracing: correct module boundaries for ftrace_release tracing: fix transposed numbers of lock_depth and preempt_count trace: Fix missing assignment in trace_ctxwake_* tracing: Use free_percpu instead of kfree tracing: Check total refcount before releasing bufs in profile_enable failure	2009-10-08 12:06:09 -07:00
Linus Torvalds	b924f9599d	Merge branch 'sparc-perf-events-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sparc-perf-events-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA perf_event: Provide vmalloc() based mmap() backing	2009-10-08 12:05:50 -07:00
Linus Torvalds	b9d40b7b1e	Merge branch 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf_events: Make ABI definitions available to userspace perf tools: elf_sym__is_function() should accept "zero" sized functions tracing/syscalls: Use long for syscall ret format and field definitions perf trace: Update eval_flag() flags array to match interrupt.h perf trace: Remove unused code in builtin-trace.c perf: Propagate term signal to child	2009-10-08 12:05:00 -07:00
Steven Rostedt	8f6e8a314a	tracing: user local buffer variable for trace branch tracer Just using the tr->buffer for the API to trace_buffer_lock_reserve is not good enough. This is because the tr->buffer may change, and we do not want to commit with a different buffer that we reserved from. This patch uses a local variable to hold the buffer that was used to reserve and commit with. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-10-07 21:53:41 -04:00
Zhenwen Xu	c8647b2872	tracing: fix warning on kernel/trace/trace_branch.c andtrace_hw_branches.c fix warnings that caused the API change of trace_buffer_lock_reserve() change files: kernel/trace/trace_hw_branch.c kernel/trace/trace_branch.c Signed-off-by: Zhenwen Xu <helight.xu@gmail.com> LKML-Reference: <20091008012146.GA4170@helight> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-10-07 21:52:03 -04:00
Steven Rostedt	3279ba37db	ftrace: check for failure for all conversions Due to legacy code from back when the dynamic tracer used a daemon, only core kernel code was checking for failures. This is no longer the case. We must check for failures any time we perform text modifications. Cc: stable@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-10-07 17:22:24 -04:00
jolsa@redhat.com	e7247a15ff	tracing: correct module boundaries for ftrace_release When the module is about the unload we release its call records. The ftrace_release function was given wrong values representing the module core boundaries, thus not releasing its call records. Plus making ftrace_release function module specific. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1254934835-363-3-git-send-email-jolsa@redhat.com> Cc: stable@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-10-07 15:52:09 -04:00
Darren Hart	da08568101	futex: fix requeue_pi key imbalance If futex_wait_requeue_pi() wakes prior to requeue, we drop the reference to the source futex_key twice, once in handle_early_requeue_pi_wakeup() and once on our way out. Remove the drop from the handle_early_requeue_pi_wakeup() and keep the get/drops together in futex_wait_requeue_pi(). Reported-by: Helge Bahmann <hcb@chaoticmind.net> Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Helge Bahmann <hcb@chaoticmind.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: stable-2.6.31 <stable@kernel.org> LKML-Reference: <4ACCE21E.5030805@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-10-07 21:22:03 +02:00
Steven Rostedt	829b876dfc	tracing: fix transposed numbers of lock_depth and preempt_count The lock_depth and preempt_count numbers in the latency format is transposed. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-10-07 14:05:04 -04:00
Eero Nurkkala	fdc6f192e7	NOHZ: update idle state also when NOHZ is inactive Commit `f2e21c9610` had unfortunate side effects with cpufreq governors on some systems. If the system did not switch into NOHZ mode ts->inidle is not set when tick_nohz_stop_sched_tick() is called from the idle routine. Therefor all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick() fail to call tick_nohz_start_idle(). This results in bogus idle accounting information which is passed to cpufreq governors. Set the inidle flag unconditionally of the NOHZ active state to keep the idle time accounting correct in any case. [ tglx: Added comment and tweaked the changelog ] Reported-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> Cc: Rik van Riel <riel@redhat.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Greg KH <greg@kroah.com> Cc: Steven Noonan <steven@uplinklabs.net> Cc: stable@kernel.org LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-10-07 13:05:05 +02:00
Paul E. McKenney	978c0b8814	rcu: Place root rcu_node structure in separate lockdep class Before this patch, all of the rcu_node structures were in the same lockdep class, so that lockdep would complain when rcu_preempt_offline_tasks() acquired the root rcu_node structure's lock while holding one of the leaf rcu_nodes' locks. This patch changes rcu_init_one() to use a separate spin_lock_init() for the root rcu_node structure's lock than is used for that of all of the rest of the rcu_node structures, which puts the root rcu_node structure's lock in its own lockdep class. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12548908983277-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-07 08:11:21 +02:00
Paul E. McKenney	e74f4c4564	rcu: Make hot-unplugged CPU relinquish its own RCU callbacks The current interaction between RCU and CPU hotplug requires that RCU block in CPU notifiers waiting for callbacks to drain. This can be greatly simplified by having each CPU relinquish its own callbacks, and for both _rcu_barrier() and CPU_DEAD notifiers to adopt all callbacks that were previously relinquished. This change also eliminates the possibility of certain types of hangs due to the previous practice of waiting for callbacks to be invoked from within CPU notifiers. If you don't every wait, you cannot hang. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1254890898456-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-07 08:11:20 +02:00
Paul E. McKenney	d0ec774cb2	rcu: Move rcu_barrier() to rcutree Move the existing rcu_barrier() implementation to rcutree.c, consistent with the fact that the rcu_barrier() implementation is tied quite tightly to the RCU implementation. This opens the way to simplify and fix rcutree.c's rcu_barrier() implementation in a later patch. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12548908982563-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-07 08:11:20 +02:00
Thomas Gleixner	322a2c100a	futex: Move exit_pi_state() call to release_mm() exit_pi_state() is called from do_exit() but not from do_execve(). Move it to release_mm() so it gets called from do_execve() as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: stable@kernel.org Cc: Anirban Sinha <ani@anirban.org> Cc: Peter Zijlstra <peterz@infradead.org>	2009-10-06 17:00:01 +02:00
Peter Zijlstra	fc6b177dee	futex: Nullify robust lists after cleanup The robust list pointers of user space held futexes are kept intact over an exec() call. When the exec'ed task exits exit_robust_list() is called with the stale pointer. The risk of corruption is minimal, but still it is incorrect to keep the pointers valid. Actually glibc should uninstall the robust list before calling exec() but we have to deal with it anyway. Nullify the pointers after [compat_]exit_robust_list() has been called. Reported-by: Anirban Sinha <ani@anirban.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: stable@kernel.org	2009-10-06 17:00:01 +02:00
Tom Zanussi	26a50744b2	tracing/events: Add 'signed' field to format files The sign info used for filters in the kernel is also useful to applications that process the trace stream. Add it to the format files and make it available to userspace. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: rostedt@goodmis.org Cc: lizf@cn.fujitsu.com Cc: hch@infradead.org Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <1254809398-8078-2-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-06 15:04:45 +02:00
Hiroshi Shimamoto	b0f56f1a63	trace: Fix missing assignment in trace_ctxwake_* The state char variable S should be reassigned, if S == 0. We are missing the state of the task that is going to sleep for the context switch events (in the raw mode). Fortunately the problem arises with the sched_switch/wake_up tracers, not the sched trace events. The formers are legacy now. But still, that was buggy. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4AC43118.6050409@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-06 14:28:24 +02:00
Peter Zijlstra	906010b213	perf_event: Provide vmalloc() based mmap() backing Some architectures such as Sparc, ARM and MIPS (basically everything with flush_dcache_page()) need to deal with dcache aliases by carefully placing pages in both kernel and user maps. These architectures typically have to use vmalloc_user() for this. However, on other architectures, vmalloc() is not needed and has the downsides of being more restricted and slower than regular allocations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: David Miller <davem@davemloft.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1254830228.21044.272.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-06 14:21:50 +02:00
Tom Zanussi	ee949a86b3	tracing/syscalls: Use long for syscall ret format and field definitions The syscall event definitions use long for the syscall exit ret value, but unsigned long for the same thing in the format and field definitions. Change them all to long. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: rostedt@goodmis.org Cc: lizf@cn.fujitsu.com Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <1254808849-7829-4-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-06 12:02:34 +02:00
Jayson R. King	cf82ff7ea7	sched: Remove obsolete comment in sched_init() Remove the comment about calling alloc_bootmem() as it is not called here since commit `36b7b6d465`. Signed-off-by: Jayson R. King <dev@jaysonking.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <trivial@kernel.org> LKML-Reference: <4AC9C8A6.6010209@jaysonking.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 21:37:22 +02:00
Thomas Gleixner	eaaea8036d	futex: Fix locking imbalance Rich reported a lock imbalance in the futex code: http://bugzilla.kernel.org/show_bug.cgi?id=14288 It's caused by the displacement of the retry_private label in futex_wake_op(). The code unlocks the hash bucket locks in the error handling path and retries without locking them again which makes the next unlock fail. Move retry_private so we lock the hash bucket locks when we retry. Reported-by: Rich Ercolany <rercola@acm.jhu.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: stable-2.6.31 <stable@kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 21:08:14 +02:00
Aaro Koskinen	d014e8894d	panic: Fix panic message visibility by calling bust_spinlocks(0) before dying Commit `ffd71da4e3` ("panic: decrease oops_in_progress only after having done the panic") moved bust_spinlocks(0) to the end of the function, which in practice is never reached. As a result console_unblank() is not called, and on some systems the user may not see the panic message. Move it back up to before the unblanking. Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1254483680-25578-1-git-send-email-aaro.koskinen@nokia.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 21:08:09 +02:00
Linus Torvalds	41cb6654eb	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Run generate-cmdlist.sh properly perf_event: Clean up perf_event_init_task() perf_event: Fix event group handling in __perf_event_sched_*() perf timechart: Add a power-only mode perf top: Add poll_idle to the skip list	2009-10-05 12:04:41 -07:00
Linus Torvalds	e69a9ac596	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: Remove overly verbose "switch to high res mode" message	2009-10-05 12:04:16 -07:00
Linus Torvalds	0f26ec69f0	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kmemtrace: Fix up tracer registration tracing: Fix infinite recursion in ftrace_update_pid_func()	2009-10-05 12:03:43 -07:00
Paul E. McKenney	135c8aea55	rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function The rcu_barrier enum causes several problems: (1) you have to define the enum somewhere, and there is no convenient place, (2) the difference between TREE_RCU and TREE_PREEMPT_RCU causes problems when you need to map from rcu_barrier enum to struct rcu_state, (3) the switch statement are large, and (4) TINY_RCU really needs a different rcu_barrier() than do the treercu implementations. So replace it with a functionally equivalent but cleaner function pointer abstraction. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12541998232366-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 21:02:05 +02:00
Paul E. McKenney	a0b6c9a78c	rcu: Clean up code based on review feedback from Josh Triplett, part 4 These issues identified during an old-fashioned face-to-face code review extending over many hours. This group improves an existing abstraction and introduces two new ones. It also fixes an RCU stall-warning bug found while making the other changes. o Make RCU_INIT_FLAVOR() declare its own variables, removing the need to declare them at each call site. o Create an rcu_for_each_leaf() macro that scans the leaf nodes of the rcu_node tree. o Create an rcu_for_each_node_breadth_first() macro that does a breadth-first traversal of the rcu_node tree, AKA stepping through the array in index-number order. o If all CPUs corresponding to a given leaf rcu_node structure go offline, then any tasks queued on that leaf will be moved to the root rcu_node structure. Therefore, the stall-warning code must dump out tasks queued on the root rcu_node structure as well as those queued on the leaf rcu_node structures. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12541491934126-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 21:02:04 +02:00
Paul E. McKenney	3d76c08290	rcu: Clean up code based on review feedback from Josh Triplett, part 3 Whitespace fixes, updated comments, and trivial code movement. o Fix whitespace error in RCU_HEAD_INIT() o Move "So where is rcu_write_lock()" comment so that it does not come between the rcu_read_unlock() header comment and the rcu_read_unlock() definition. o Move the module_param statements for blimit, qhimark, and qlowmark to immediately follow the corresponding definitions. o In __rcu_offline_cpu(), move the assignment to rdp_me inside the "if" statement, given that rdp_me is not used outside of that "if" statement. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12541491931164-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 21:02:02 +02:00
Paul E. McKenney	162cc2794d	rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y Move the rcu_lock_map definition from rcutree.c to rcupdate.c so that TINY_RCU can use lockdep. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 21:01:28 +02:00
john stultz	7bc7d63745	time: Remove xtime_cache With the prior logarithmic time accumulation patch, xtime will now always be within one "tick" of the current time, instead of possibly half a second off. This removes the need for the xtime_cache value, which always stored the time at the last interrupt, so this patch cleans that up removing the xtime_cache related code. This is a bit simpler, but still could use some wider testing. Signed-off-by: John Stultz <johnstul@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1254525855.7741.95.camel@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 13:52:02 +02:00
john stultz	a092ff0f90	time: Implement logarithmic time accumulation Accumulating one tick at a time works well unless we're using NOHZ. Then it can be an issue, since we may have to run through the loop a few thousand times, which can increase timer interrupt caused latency. The current solution was to accumulate in half-second intervals with NOHZ. This kept the number of loops down, however it did slightly change how we make NTP adjustments. While not an issue with NTPd users, as NTPd makes adjustments over a longer period of time, other adjtimex() users have noticed the half-second granularity with which we can apply frequency changes to the clock. For instance, if a application tries to apply a 100ppm frequency correction for 20ms to correct a 2us offset, with NOHZ they either get no correction, or a 50us correction. Now, there will always be some granularity error for applying frequency corrections. However with users sensitive to this error have seen a 50-500x increase with NOHZ compared to running without NOHZ. So I figured I'd try another approach then just simply increasing the interval. My approach is to consume the time interval logarithmically. This reduces the number of times through the loop needed keeping latency down, while still preserving the original granularity error for adjtimex() changes. Further, this change allows us to remove the xtime_cache code (patch to follow), as xtime is always within one tick of the current time, instead of the half-second updates it saw before. An earlier version of this patch has been shipping to x86 users in the RedHat MRG releases for awhile without issue, but I've reworked this version to be even more careful about avoiding possible overflows if the shift value gets too large. Signed-off-by: John Stultz <johnstul@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1254525473.7741.88.camel@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 13:51:48 +02:00
Peter Williams	f83f9ac263	sched: Set correct normal_prio and prio values in sched_fork() normal_prio should be updated if policy changes from RT to SCHED_MORMAL or if static_prio/nice is changed. Some paths through sched_fork() ignore this requirement and may result in normal_prio having an invalid value. Fixing this issue allows the call to effective_prio() in wake_up_new_task() to be removed. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <f8f46736fd4e7f090ac0.1253774830@mudlark.pw.nest> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-05 13:42:20 +02:00
Frederic Weisbecker	75fb4090b3	tracing: Use free_percpu instead of kfree In the event->profile_enable() failure path, we release the per cpu buffers using kfree which is wrong because they are per cpu pointers. Although free_percpu only wraps kfree for now, that may change in the future so lets use the correct way. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-10-05 10:57:56 +02:00
Frederic Weisbecker	fe8e5b5a60	tracing: Check total refcount before releasing bufs in profile_enable failure When we call the profile_enable() callback of an event, we release the shared perf event tracing buffers unconditionnaly in the failure path. This is wrong because there may be other users of these. Then check the total refcount before doing this. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-10-05 10:57:41 +02:00
Linus Torvalds	58e57fbd1c	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits) Revert "Seperate read and write statistics of in_flight requests" cfq-iosched: don't delay async queue if it hasn't dispatched at all block: Topology ioctls cfq-iosched: use assigned slice sync value, not default cfq-iosched: rename 'desktop' sysfs entry to 'low_latency' cfq-iosched: implement slower async initiate and queue ramp up cfq-iosched: delay async IO dispatch, if sync IO was just done cfq-iosched: add a knob for desktop interactiveness Add a tracepoint for block request remapping block: allow large discard requests block: use normal I/O path for discard requests swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL fs/bio.c: move EXPORT* macros to line after function Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs cciss: fix build when !PROC_FS block: Do not clamp max_hw_sectors for stacking devices block: Set max_sectors correctly for stacking devices cciss: cciss_host_attr_groups should be const cciss: Dynamically allocate the drive_info_struct for each logical drive. cciss: Add usage_count attribute to each logical drive in /sys ...	2009-10-04 12:39:14 -07:00
Andi Kleen	1087e9b4ff	HWPOISON: Clean up PR_MCE_KILL interface While writing the manpage I noticed some shortcomings in the current interface. - Define symbolic names for all the different values - Boundary check the kill mode values - For symmetry add a get interface too. This allows library code to get/set the current state. - For consistency define a PR_MCE_KILL_DEFAULT value Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-10-04 03:23:17 +02:00
Christoph Lameter	e800879d50	this_cpu: Use this_cpu operations in RCU RCU does not do dynamic allocations but it increments per cpu variables a lot. These instructions results in a move to a register and then back to memory. This patch will make it use the inc/dec instructions on x86 that do not need a register. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2009-10-03 19:48:23 +09:00
Masami Hiramatsu	88f70d7590	tracing/ftrace: Fix to check create_event_dir() when adding new events Check result of event_create_dir() and add ftrace_event_call to ftrace_events list only if it is succeeded. Thanks to Li for pointing it out. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <20090925182054.10157.55219.stgit@omoto> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-03 03:04:58 +02:00
Masami Hiramatsu	a1a138d05f	tracing/kprobes: Use global event perf buffers in kprobe tracer Use new percpu global event buffer instead of stack in kprobe tracer while tracing through perf. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <20090925182011.10157.60140.stgit@omoto> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-10-03 02:21:39 +02:00
Tejun Heo	23fb064bb9	percpu: kill legacy percpu allocator With ia64 converted, there's no arch left which still uses legacy percpu allocator. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Delightedly-acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org>	2009-10-02 13:29:29 +09:00
KAMEZAWA Hiroyuki	4e649152cb	memcg: some modification to softlimit under hierarchical memory reclaim. This patch clean up/fixes for memcg's uncharge soft limit path. Problems: Now, res_counter_charge()/uncharge() handles softlimit information at charge/uncharge and softlimit-check is done when event counter per memcg goes over limit. Now, event counter per memcg is updated only when memory usage is over soft limit. Here, considering hierarchical memcg management, ancesotors should be taken care of. Now, ancerstors(hierarchy) are handled in charge() but not in uncharge(). This is not good. Prolems: 1. memcg's event counter incremented only when softlimit hits. That's bad. It makes event counter hard to be reused for other purpose. 2. At uncharge, only the lowest level rescounter is handled. This is bug. Because ancesotor's event counter is not incremented, children should take care of them. 3. res_counter_uncharge()'s 3rd argument is NULL in most case. ops under res_counter->lock should be small. No "if" sentense is better. Fixes: * Removed soft_limit_xx poitner and checks in charge and uncharge. Do-check-only-when-necessary scheme works enough well without them. * make event-counter of memcg incremented at every charge/uncharge. (per-cpu area will be accessed soon anyway) * All ancestors are checked at soft-limit-check. This is necessary because ancesotor's event counter may never be modified. Then, they should be checked at the same time. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-01 16:11:13 -07:00
KAMEZAWA Hiroyuki	3dece8347d	cgroup: catch bad css refcnt at css_put __css_put() doesn't check a bug as refcnt goes to minus. I think it should be caught. This patch adds a check for it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-01 16:11:12 -07:00
Alexey Dobriyan	828c09509b	const: constify remaining file_operations [akpm@linux-foundation.org: fix KVM] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-01 16:11:11 -07:00
Paul Mundt	3ae91c21dd	module: fix up CONFIG_KALLSYMS=n build. Starting from commit `4a4962263f` "reduce symbol table for loaded modules (v2)", the kernel/module.c build is broken with CONFIG_KALLSYMS disabled. CC kernel/module.o kernel/module.c:1995: warning: type defaults to 'int' in declaration of 'Elf_Hdr' kernel/module.c:1995: error: expected ';', ',' or ')' before '' token kernel/module.c: In function 'load_module': kernel/module.c:2203: error: 'strmap' undeclared (first use in this function) kernel/module.c:2203: error: (Each undeclared identifier is reported only once kernel/module.c:2203: error: for each function it appears in.) kernel/module.c:2239: error: 'symoffs' undeclared (first use in this function) kernel/module.c:2239: error: implicit declaration of function 'layout_symtab' kernel/module.c:2240: error: 'stroffs' undeclared (first use in this function) make[1]: [kernel/module.o] Error 1 make: * [kernel/module.o] Error 2 There are three different issues: - layout_symtab() takes a const Elf_Ehdr - layout_symtab() needs to return a value - symoffs/stroffs/strmap are referenced by the load_module() code despite being ifdefed out, which seems unnecessary given the noop behaviour of layout_symtab()/add_kallsyms() in the case of CONFIG_KALLSYMS=n. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-01 16:11:11 -07:00
Jun'ichi Nomura	b0da3f0dad	Add a tracepoint for block request remapping Since 2.6.31 now has request-based device-mapper, it's useful to have a tracepoint for request-remapping as well as bio-remapping. This patch adds a tracepoint for request-remapping, trace_block_rq_remap(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:19:34 +02:00
Zdenek Kabelac	48c0d4d4c0	Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs introduced in commit `1d54ad6da9`. Release kobject also in case the request_fn is NULL. Problem was noticed via kmemleak backtrace when some sysfs entries were note properly destroyed during device removal: unreferenced object 0xffff88001aa76640 (size 80): comm "lvcreate", pid 2120, jiffies 4294885144 hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 f0 65 a7 1a 00 88 ff ff .........e...... 90 66 a7 1a 00 88 ff ff 86 1d 53 81 ff ff ff ff .f........S..... backtrace: [<ffffffff813f9cc6>] kmemleak_alloc+0x26/0x60 [<ffffffff8111d693>] kmem_cache_alloc+0x133/0x1c0 [<ffffffff81195891>] sysfs_new_dirent+0x41/0x120 [<ffffffff81194b0c>] sysfs_add_file_mode+0x3c/0xb0 [<ffffffff81197c81>] internal_create_group+0xc1/0x1a0 [<ffffffff81197d93>] sysfs_create_group+0x13/0x20 [<ffffffff810d8004>] blk_trace_init_sysfs+0x14/0x20 [<ffffffff8123f45c>] blk_register_queue+0x3c/0xf0 [<ffffffff812447e4>] add_disk+0x94/0x160 [<ffffffffa00d8b08>] dm_create+0x598/0x6e0 [dm_mod] [<ffffffffa00de951>] dev_create+0x51/0x350 [dm_mod] [<ffffffffa00de823>] ctl_ioctl+0x1a3/0x240 [dm_mod] [<ffffffffa00de8f2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod] [<ffffffff81177bfd>] compat_sys_ioctl+0xcd/0x4f0 [<ffffffff81036ed8>] sysenter_dispatch+0x7/0x2c [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:15:46 +02:00
Avi Kivity	7c68af6e32	core, x86: Add user return notifiers Add a general per-cpu notifier that is called whenever the kernel is about to return to userspace. The notifier uses a thread_info flag and existing checks, so there is no impact on user return or context switch fast paths. This will be used initially to speed up KVM task switching by lazily updating MSRs. Signed-off-by: Avi Kivity <avi@redhat.com> LKML-Reference: <1253342422-13811-1-git-send-email-avi@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-10-01 12:12:18 -07:00
Paul Mundt	f9ac5a69ed	kmemtrace: Fix up tracer registration Commit `ddc1637af2` ("kmemtrace: Print binary output only if 'bin' option is set") ended up inverting the error detection logic. register_tracer() returns 0 on success, which this change caused to treat as an error, resulting in: [ 0.132000] Warning: could not register the kmem tracer as well as bailing out of the initcall with an error value. This restores the old logic. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <20090928075540.GD6668@linux-sh.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-01 11:53:44 +02:00
Ingo Molnar	0aa73ba1c4	Merge branch 'tracing/urgent' into tracing/core Merge reason: Pick up latest fixes and update to latest upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-01 11:20:48 +02:00
Xiao Guangrong	27f9994c50	perf_event: Clean up perf_event_init_task() While at it: we can traverse ctx->group_list to get all group leader, it should be safe since we hold ctx->mutex. Changlog v1->v2: - remove WARN_ON_ONCE() according to Peter Zijlstra's suggestion Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4ABC5AF9.6060808@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-01 09:30:44 +02:00
Xiao Guangrong	8c9ed8e14c	perf_event: Fix event group handling in __perf_event_sched_*() Paul Mackerras says: "Actually, looking at this more closely, it has to be a group leader anyway since it's at the top level of ctx->group_list. In fact I see four places where we do: list_for_each_entry(event, &ctx->group_list, group_entry) { if (event == event->group_leader) ... or the equivalent, three of which appear to have been introduced by `afedadf2` ("perf_counter: Optimize sched in/out of counters") back in May by Peter Z. As far as I can see the if () is superfluous in each case (a singleton event will be a group of 1 and will have its group_leader pointing to itself)." [ See: http://marc.info/?l=linux-kernel&m=125361238901442&w=2 ] And Peter Zijlstra points out this is a bugfix: "The intent was to call event_sched_{in,out}() for single event groups because that's cheaper than group_sched_{in,out}(), however.. - as you noticed, I got the condition wrong, it should have read: list_empty(&event->sibling_list) - it failed to call group_can_go_on() which deals with ->exclusive. - it also doesn't call hw_perf_group_sched_in() which might break power." [ See: http://marc.info/?l=linux-kernel&m=125369523318583&w=2 ] Changelog v1->v2: - Fix the title name according to Peter Zijlstra's suggestion - Remove the comments and WARN_ON_ONCE() as Peter Zijlstra's suggestion Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4ABC5A55.7000208@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-01 09:30:44 +02:00
Matt Fleming	33974093c0	tracing: Fix infinite recursion in ftrace_update_pid_func() When CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST is enabled __ftrace_trace_function contains the current trace function, not ftrace_trace_function. In ftrace_update_pid_func() we currently incorrectly assign the value of ftrace_trace_function to __ftrace_trace_funcion before returning. Without this patch it is possible to execute an infinite recursion whereby ftrace_test_stop_func() calls __ftrace_trace_function, which was assigned ftrace_test_stop_func() in ftrace_update_pid_func(). Signed-off-by: Matt Fleming <matthew.fleming@imgtec.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1254152581-18347-1-git-send-email-matt@console-pimps.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-10-01 08:19:24 +02:00
Eric Dumazet	152f9d0710	sched_clock: Fix atomicity/continuity bug by using cmpxchg64() Commit `def0a9b257` (sched_clock: Make it NMI safe) assumed cmpxchg() of 64bit values was available on X86_32. That is not so - and causes some subtle scheduler misbehavior due to incorrect timestamps off to up by ~4 seconds. Two symptoms are known right now: - interactivity problems seen by Arjan: up to 600 msecs latencies instead of the expected 20-40 msecs. These latencies are very visible on the desktop. - incorrect CPU stats: occasionally too high percentages in 'top', and crazy CPU usage stats. Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: John Stultz <johnstul@us.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20090930170754.0886ff2e@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-30 22:56:10 +02:00
Alexey Dobriyan	f0f37e2f77	const: mark struct vm_struct_operations * mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-27 11:39:25 -07:00
Linus Torvalds	6f5071020d	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: Eliminate needless reprogramming of clock events device	2009-09-27 10:39:04 -07:00
Linus Torvalds	3b383767c4	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Add memory barrier commentary to futex_wait_queue_me() futex: Fix wakeup race by setting TASK_INTERRUPTIBLE before queue_me() futex: Correct futex_q woken state commentary futex: Make function kernel-doc commentary consistent futex: Correct queue_me and unqueue_me commentary futex: Correct futex_wait_requeue_pi() commentary	2009-09-26 10:15:53 -07:00
Linus Torvalds	179b9145d5	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Resume clocksource without taking the clocksource mutex	2009-09-26 10:14:41 -07:00
Linus Torvalds	4187e7e9f1	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: modules, tracing: Remove stale struct marker signature from module_layout() tracing/workqueue: Use %pf in workqueue trace events tracing: Fix a comment and a trivial format issue in tracepoint.h tracing: Fix failure path in ftrace_regex_open() tracing: Fix failure path in ftrace_graph_write() tracing: Check the return value of trace_get_user() tracing: Fix off-by-one in trace_get_user()	2009-09-26 10:13:54 -07:00
Roland Dreier	d3f6302e7e	hrtimer: Remove overly verbose "switch to high res mode" message On big systems, printing <number of CPUs> copies of Switched to high resolution mode on CPU nnn clutters up the kernel log for minimal gain. Just get rid of them. Signed-off-by: Roland Dreier <rolandd@cisco.com> LKML-Reference: <ada1vlw126s.fsf_-_@cisco.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-26 16:58:09 +02:00
David S. Miller	8b3f6af863	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/staging/Kconfig drivers/staging/Makefile drivers/staging/cpc-usb/TODO drivers/staging/cpc-usb/cpc-usb_drv.c drivers/staging/cpc-usb/cpc.h drivers/staging/cpc-usb/cpc_int.h drivers/staging/cpc-usb/cpcusb.h	2009-09-24 15:13:11 -07:00
Martin Schwidefsky	89133f9350	clocksource: Resume clocksource without taking the clocksource mutex git commit `75c5158f70` converted the clocksource spinlock to a mutex. This causes the following BUG: BUG: sleeping function called from invalid context at kernel/mutex.c:280 in_atomic(): 0, irqs_disabled(): 1, pid: 2473, name: pm-suspend 2 locks held by pm-suspend/2473: #0: (&buffer->mutex){......}, at: [<ffffffff8115ab13>] sysfs_write_file+0x3c/0x137 #1: (pm_mutex){......}, at: [<ffffffff810865b5>] enter_state+0x39/0x130 Pid: 2473, comm: pm-suspend Not tainted 2.6.31 #1 Call Trace: [<ffffffff810792f0>] ? __debug_show_held_locks+0x22/0x24 [<ffffffff8104a2ef>] __might_sleep+0x107/0x10b [<ffffffff8141fca9>] mutex_lock_nested+0x25/0x43 [<ffffffff81073537>] clocksource_resume+0x1c/0x60 [<ffffffff81072902>] timekeeping_resume+0x1e/0x1c8 [<ffffffff812aee62>] __sysdev_resume+0x25/0xcf [<ffffffff812aef79>] sysdev_resume+0x6d/0xae [<ffffffff810864f8>] suspend_devices_and_enter+0x12b/0x1af [<ffffffff8108665b>] enter_state+0xdf/0x130 [<ffffffff81085dc3>] state_store+0xb6/0xd3 [<ffffffff81204c73>] kobj_attr_store+0x17/0x19 [<ffffffff8115abd2>] sysfs_write_file+0xfb/0x137 [<ffffffff811057d2>] vfs_write+0xae/0x10b [<ffffffff81208392>] ? __up_read+0x1a/0x7f [<ffffffff811058ef>] sys_write+0x4a/0x6e [<ffffffff81011b82>] system_call_fastpath+0x16/0x1b clocksource_resume is called early in the resume process, there is only one cpu, no processes are running and the interrupts are disabled. It is therefore possible to resume the clocksources without taking the clocksource mutex. Reported-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Xiaotian Feng <xtfeng@gmail.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090924172952.49697825@mschwide.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-24 22:37:53 +02:00
Darren Hart	9beba3c54d	futex: Add memory barrier commentary to futex_wait_queue_me() The memory barrier semantics of futex_wait_queue_me() are non-obvious. Add some commentary to try and clarify it. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090924185447.694.38948.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-24 22:30:10 +02:00
Frederic Weisbecker	3f6fe06dbf	tracing/filters: Unify the regex parsing helpers The filter code has stolen the regex parsing function from ftrace to get the regex support. We have duplicated this code, so factorize it in the filter area and make it generally available, as the filter code is the most suited to host this feature. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-09-24 21:40:13 +02:00
Frederic Weisbecker	1889d20922	tracing/filters: Provide basic regex support This patch provides basic support for regular expressions in filters. It supports the following types of regexp: - match_beginning - match_middle* - match_end* - !don't match Example: cd /debug/tracing/events/bkl/lock_kernel echo 'file == "reiserfs"' > filter echo 1 > enable gedit-4941 [000] 457.735437: lock_kernel: depth: 0, fs/reiserfs/namei.c:334 reiserfs_lookup() sync_supers-227 [001] 461.379985: lock_kernel: depth: 0, fs/reiserfs/super.c:69 reiserfs_sync_fs() sync_supers-227 [000] 461.383096: lock_kernel: depth: 0, fs/reiserfs/journal.c:1069 flush_commit_list() reiserfs/1-1369 [001] 461.479885: lock_kernel: depth: 0, fs/reiserfs/journal.c:3509 flush_async_commits() Every string is now handled as a regexp in the filter framework, which helps to factorize the code for handling both simple strings and regexp comparisons. (The regexp parsing code has been wildly cherry picked from ftrace.c written by Steve.) v2: Simplify the whole and drop the filter_regex file Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-09-24 21:39:27 +02:00
Linus Torvalds	a6b49cb210	Merge branch 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze * 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze: (24 commits) microblaze: Disable heartbeat/enable emaclite in defconfigs microblaze: Support simpleImage.dts make target microblaze: Fix _start symbol to physical address microblaze: Use LOAD_OFFSET macro to get correct LMA for all sections microblaze: Create the LOAD_OFFSET macro used to compute VMA vs LMA offsets microblaze: Copy ppc asm-compat.h for clean handling of constants in asm and C microblaze: Actually show KiB rather than pages in "Freeing initrd memory:" microblaze: Support ptrace syscall tracing. microblaze: Updated CPU version and FPGA family codes in PVR microblaze: Generate correct signal and siginfo for integer div-by-zero microblaze: Don't be noisy when userspace causes hardware exceptions microblaze: Remove ipc.h file which points to non-existing asm-generic file microblaze: Clear sticky FSR register after generating exception signals microblaze: Ensure CPU usermode is set on new userspace processes microblaze: Use correct kbuild variable KBUILD_CFLAGS microblaze: Save and restore msr in hw exception microblaze: Add architectural support for USB EHCI host controllers microblaze: Implement include/asm/syscall.h. microblaze: Improve checking mechanism for MSR instruction microblaze: Add checking mechanism for MSR instruction ...	2009-09-24 09:01:44 -07:00
Linus Torvalds	2c9871de0a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: don't call percpu_modfree on NULL pointer. module: fix memory leak when load fails after srcversion/version allocated module: preferred way to use MODULE_AUTHOR param: allow whitespace as kernel parameter separator module: reduce string table for loaded modules (v2) module: reduce symbol table for loaded modules (v2)	2009-09-24 09:01:05 -07:00
Linus Torvalds	6d39b27f0a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: lsm: Use a compressed IPv6 string format in audit events Audit: send signal info if selinux is disabled Audit: rearrange audit_context to save 16 bytes per struct Audit: reorganize struct audit_watch to save 8 bytes	2009-09-24 08:31:04 -07:00
Rusty Russell	ffa9f12a41	module: don't call percpu_modfree on NULL pointer. The general one handles NULL, the static obsolescent (CONFIG_HAVE_LEGACY_PER_CPU_AREA) one in module.c doesn't; Eric's commit `720eba31` assumed it did, and various frobbings since then kept that assumption. All other callers in module.c all protect it with an if; this effectively does the same as free_init is only goto if we fail percpu_modalloc(). Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Américo Wang <xiyou.wangcong@gmail.com> Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>	2009-09-25 00:32:59 +09:30
Rusty Russell	a263f7763c	module: fix memory leak when load fails after srcversion/version allocated Normally the twisty paths of sysfs will free the attributes, but not if we fail before we hook it into sysfs (which is the last thing we do in load_module). (This sysfs code is a turd, no doubt there are other issues lurking too). Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>	2009-09-25 00:32:59 +09:30
Peter Oberparleiter	26d052bfce	param: allow whitespace as kernel parameter separator Some boot mechanisms require that kernel parameters are stored in a separate file which is loaded to memory without further processing (e.g. the "Load from FTP" method on s390). When such a file contains newline characters, the kernel parameter preceding the newline might not be correctly parsed (due to the newline being stuck to the end of the actual parameter value) which can lead to boot failures. This patch improves kernel command line usability in such a situation by allowing generic whitespace characters as separators between kernel parameters. Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-09-25 00:32:58 +09:30
Jan Beulich	554bdfe5ac	module: reduce string table for loaded modules (v2) Also remove all parts of the string table (referenced by the symbol table) that are not needed for kallsyms use (i.e. which were only referenced by symbols discarded by the previous patch, or not referenced at all for whatever reason). Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-09-25 00:32:57 +09:30
Jan Beulich	4a4962263f	module: reduce symbol table for loaded modules (v2) Discard all symbols not interesting for kallsyms use: absolute, section, and in the common case (!KALLSYMS_ALL) data ones. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-09-25 00:32:57 +09:30
Linus Torvalds	db16826367	Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits) HWPOISON: Enable error_remove_page on btrfs HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs HWPOISON: Add madvise() based injector for hardware poisoned pages v4 HWPOISON: Enable error_remove_page for NFS HWPOISON: Enable .remove_error_page for migration aware file systems HWPOISON: The high level memory error handler in the VM v7 HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process HWPOISON: shmem: call set_page_dirty() with locked page HWPOISON: Define a new error_remove_page address space op for async truncation HWPOISON: Add invalidate_inode_page HWPOISON: Refactor truncate to allow direct truncating of page v2 HWPOISON: check and isolate corrupted free pages v2 HWPOISON: Handle hardware poisoned pages in try_to_unmap HWPOISON: Use bitmask/action code for try_to_unmap behaviour HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 HWPOISON: Add poison check to page fault handling HWPOISON: Add basic support for poisoned pages in fault handler v3 HWPOISON: Add new SIGBUS error codes for hardware poison signals HWPOISON: Add support for poison swap entries v2 HWPOISON: Export some rmap vma locking to outside world ...	2009-09-24 07:53:22 -07:00
Hiroshi Shimamoto	801460d0cf	task_struct cleanup: move binfmt field to mm_struct Because the binfmt is not different between threads in the same process, it can be moved from task_struct to mm_struct. And binfmt moudle is handled per mm_struct instead of task_struct. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Alexey Dobriyan	858f09930b	aio: ifdef fields in mm_struct ->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Sukadev Bhattiprolu	e5a4738699	pidns: deny CLONE_PARENT\|CLONE_NEWPID combination CLONE_PARENT was used to implement an older threading model. For consistency with the CLONE_THREAD check in copy_pid_ns(), disable CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of pid namespaces are clear. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:04 -07:00
Sukadev Bhattiprolu	123be07b0b	fork(): disable CLONE_PARENT for init When global or container-init processes use CLONE_PARENT, they create a multi-rooted process tree. Besides siblings of global init remain as zombies on exit since they are not reaped by their parent (swapper). So prevent global and container-inits from creating siblings. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Oren Laadan <orenl@cs.columbia.edu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:04 -07:00
Alexey Dobriyan	8d65af789f	sysctl: remove "struct file *" argument of ->proc_handler It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:04 -07:00
Roland McGrath	d9588725e5	signals: inline __fatal_signal_pending __fatal_signal_pending inlines to one instruction on x86, probably two instructions on other machines. It takes two longer x86 instructions just to call it and test its return value, not to mention the function itself. On my random x86_64 config, this saved 70 bytes of text (59 of those being __fatal_signal_pending itself). Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Oleg Nesterov	4a30debfb7	signals: introduce do_send_sig_info() helper Introduce do_send_sig_info() and convert group_send_sig_info(), send_sig_info(), do_send_specific() to use this helper. Hopefully it will have more users soon, it allows to specify specific/group behaviour via "bool group" argument. Shaves 80 bytes from .text. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stephane eranian <eranian@googlemail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Neil Horman	a293980c2e	exec: let do_coredump() limit the number of concurrent dumps to pipes Introduce core pipe limiting sysctl. Since we can dump cores to pipe, rather than directly to the filesystem, we create a condition in which a user can create a very high load on the system simply by running bad applications. If the pipe reader specified in core_pattern is poorly written, we can have lots of ourstandig resources and processes in the system. This sysctl introduces an ability to limit that resource consumption. core_pipe_limit defines how many in-flight dumps may be run in parallel, dumps beyond this value are skipped and a note is made in the kernel log. A special value of 0 in core_pipe_limit denotes unlimited core dumps may be handled (this is the default value). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Roland McGrath	ae6d2ed7bb	signals: tracehook_notify_jctl change This changes tracehook_notify_jctl() so it's called with the siglock held, and changes its argument and return value definition. These clean-ups make it a better fit for what new tracing hooks need to check. Tracing needs the siglock here, held from the time TASK_STOPPED was set, to avoid potential SIGCONT races if it wants to allow any blocking in its tracing hooks. This also folds the finish_stop() function into its caller do_signal_stop(). The function is short, called only once and only unconditionally. It aids readability to fold it in. [oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state] [oleg@redhat.com: introduce tracehook_finish_jctl() helper] Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Vitaly Mayatskikh	b6fe2d117e	wait_noreap_copyout(): check for ->wo_info != NULL Current behaviour of sys_waitid() looks odd. If user passes infop == NULL, sys_waitid() returns success. When user additionally specifies flag WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user combines WNOWAIT with WCONTINUED, sys_waitid() again returns success. This patch adds check for ->wo_info in wait_noreap_copyout(). User-visible change: starting from this commit, sys_waitid() always checks infop != NULL and does not fail if it is NULL. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Vitaly Mayatskikh	dfe16dfa4a	do_wait: fix sys_waitid()-specific behaviour do_wait() checks ->wo_info to figure out who is the caller. If it's not NULL the caller should be sys_waitid(), in that case do_wait() fixes up the retval or zeros ->wo_info, depending on retval from underlying function. This is bug: user can pass ->wo_info == NULL and sys_waitid() will return incorrect value. man 2 waitid says: waitid(): returns 0 on success Test-case: int main(void) { if (fork()) assert(waitid(P_ALL, 0, NULL, WEXITED) == 0); return 0; } Result: Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed. Move that code to sys_waitid(). User-visible change: sys_waitid() will return 0 on success, either infop is set or not. Note, there's another bug in wait_noreap_copyout() which affects return value of sys_waitid(). It will be fixed in next patch. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Oleg Nesterov	b6e763f07f	wait_consider_task: kill "parent" argument Kill the unused "parent" argument in wait_consider_task(), it was never used. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Oleg Nesterov	989264f464	do_wait-wakeup-optimization: simplify task_pid_type() task_pid_type() is only used by eligible_pid() which has to check wo_type != PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor out ->pids[type] access, this shrinks .text a bit and simplifies the code. The matches the behaviour of other similar helpers, say get_task_pid(). The caller must ensure that pid_type is valid, not the callee. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Oleg Nesterov	5c01ba49e6	do_wait-wakeup-optimization: fix child_wait_callback()->eligible_child() usage child_wait_callback()->eligible_child() is not right, we can miss the wakeup if the task was detached before __wake_up_parent() and the caller of do_wait() didn't use __WALL. Move ->wo_pid checks from eligible_child() to the new helper, eligible_pid(), and change child_wait_callback() to use it instead of eligible_child(). Note: actually I think it would be better to fix the __WCLONE check in eligible_child(), it doesn't look exactly right. But it is not clear what is the supposed behaviour, and any change is user-visible. Reported-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Oleg Nesterov	b4fe51823d	do_wait() wakeup optimization: child_wait_callback: check __WNOTHREAD case Suggested by Roland. do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or it is ->real_parent and the child is not traced. IOW, caller == p->parent otherwise we should not wake up. Change child_wait_callback() to check this. Ratan reports the workload with CPU load >99% caused by unnecessary wakeups, should be fixed by this patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Oleg Nesterov	0b7570e77f	do_wait() wakeup optimization: change __wake_up_parent() to use filtered wakeup Ratan Nalumasu reported that in a process with many threads doing unnecessary wakeups. Every waiting thread in the process wakes up to loop through the children and see that the only ones it cares about are still not ready. Now that we have struct wait_opts we can change do_wait/__wake_up_parent to use filtered wakeups. We can make child_wait_callback() more clever later, right now it only checks eligible_child(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Acked-by: James Morris <jmorris@namei.org> Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:59 -07:00
Oleg Nesterov	a2322e1d27	do_wait() wakeup optimization: shift security_task_wait() from eligible_child() to wait_consider_task() Preparation, no functional changes. eligible_child() has a single caller, wait_consider_task(). We can move security_task_wait() out from eligible_child(), this allows us to use it for filtered wake_up(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:59 -07:00
Oleg Nesterov	a7f0765edf	ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee The bug is old, it wasn't cause by recent changes. Test case: static void tfunc(void arg) { int pid = (long)arg; assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0); kill(pid, SIGKILL); sleep(1); return NULL; } int main(void) { pthread_t th; long pid = fork(); if (!pid) pause(); signal(SIGCHLD, SIG_IGN); assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0); int r = waitpid(-1, NULL, __WNOTHREAD); printf("waitpid: %d %m\n", r); return 0; } Before the patch this program hangs, after this patch waitpid() correctly fails with errno == -ECHILD. The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its ->real_parent is our sub-thread and we ignore SIGCHLD. But in this case we should wake up other threads which can sleep in do_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:59 -07:00
Balbir Singh	f64c3f5494	memory controller: soft limit organize cgroups Organize cgroups over soft limit in a RB-Tree Introduce an RB-Tree for storing memory cgroups that are over their soft limit. The overall goal is to 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded. We are careful about updates, updates take place only after a particular time interval has passed 2. We remove the node from the RB-Tree when the usage goes below the soft limit The next set of patches will exploit the RB-Tree to get the group that is over its soft limit by the largest amount and reclaim from it, when we face memory contention. [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:59 -07:00
Balbir Singh	296c81d89f	memory controller: soft limit interface Add an interface to allow get/set of soft limits. Soft limits for memory plus swap controller (memsw) is currently not supported. Resource counters have been enhanced to support soft limits and new type RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be directly set and do not need any reclaim or checks before setting them to a newer value. Kamezawa-San raised a question as to whether soft limit should belong to res_counter. Since all resources understand the basic concepts of hard and soft limits, it is justified to add soft limits here. Soft limits are a generic resource usage feature, even file system quotas support soft limits. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:59 -07:00
Ben Blum	be367d0992	cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time Alter the ss->can_attach and ss->attach functions to be able to deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a pre-patch to cgroup-procs-writable.patch.) Currently, new mode of the attach function can only tell the subsystem about the old cgroup of the threadgroup leader. No subsystem currently needs that information for each thread that's being moved, but if one were to be added (for example, one that counts tasks within a group) this bit would need to be reworked a bit to tell the subsystem the right information. [hidave.darkstar@gmail.com: fix build] Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Ben Blum	c378369d8b	cgroups: change css_set freeing mechanism to be under RCU Changes css_set freeing mechanism to be under RCU This is a prepatch for making the procs file writable. In order to free the old css_sets for each task to be moved as they're being moved, the freeing mechanism must be RCU-protected, or else we would have to have a call to synchronize_rcu() for each task before freeing its old css_set. Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Ben Blum	d1d9fd3308	cgroups: use vmalloc for large cgroups pidlist allocations Separates all pidlist allocation requests to a separate function that judges based on the requested size whether or not the array needs to be vmalloced or can be gotten via kmalloc, and similar for kfree/vfree. Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Ben Blum	72a8cb30d1	cgroups: ensure correct concurrent opening/reading of pidlists across pid namespaces Previously there was the problem in which two processes from different pid namespaces reading the tasks or procs file could result in one process seeing results from the other's namespace. Rather than one pidlist for each file in a cgroup, we now keep a list of pidlists keyed by namespace and file type (tasks versus procs) in which entries are placed on demand. Each pidlist has its own lock, and that the pidlists themselves are passed around in the seq_file's private pointer means we don't have to touch the cgroup or its master list except when creating and destroying entries. Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Ben Blum	102a775e36	cgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids struct cgroup used to have a bunch of fields for keeping track of the pidlist for the tasks file. Those are now separated into a new struct cgroup_pidlist, of which two are had, one for procs and one for tasks. The way the seq_file operations are set up is changed so that just the pidlist struct gets passed around as the private data. Interface example: Suppose a multithreaded process has pid 1000 and other threads with ids 1001, 1002, 1003: $ cat tasks 1000 1001 1002 1003 $ cat cgroup.procs 1000 $ Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Paul Menage	8f3ff20862	cgroups: revert "cgroups: fix pid namespace bug" The following series adds a "cgroup.procs" file to each cgroup that reports unique tgids rather than pids, and allows all threads in a threadgroup to be atomically moved to a new cgroup. The subsystem "attach" interface is modified to support attaching whole threadgroups at a time, which could introduce potential problems if any subsystem were to need to access the old cgroup of every thread being moved. The attach interface may need to be revised if this becomes the case. Also added is functionality for read/write locking all CLONE_THREAD fork()ing within a threadgroup, by means of an rwsem that lives in the sighand_struct, for per-threadgroup-ness and also for sharing a cacheline with the sighand's atomic count. This scheme should introduce no extra overhead in the fork path when there's no contention. The final patch reveals potential for a race when forking before a subsystem's attach function is called - one potential solution in case any subsystem has this problem is to hang on to the group's fork mutex through the attach() calls, though no subsystem yet demonstrates need for an extended critical section. This patch: Revert commit `096b7fe012` Author: Li Zefan <lizf@cn.fujitsu.com> AuthorDate: Wed Jul 29 15:04:04 2009 -0700 Commit: Linus Torvalds <torvalds@linux-foundation.org> CommitDate: Wed Jul 29 19:10:35 2009 -0700 cgroups: fix pid namespace bug This is in preparation for some clashing cgroups changes that subsume the original commit's functionaliy. The original commit fixed a pid namespace bug which Ben Blum fixed independently (in the same way, but with different code) as part of a series of patches. I played around with trying to reconcile Ben's patch series with Li's patch, but concluded that it was simpler to just revert Li's, given that Ben's patch series contained essentially the same fix. Signed-off-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Paul Menage	2c6ab6d200	cgroups: allow cgroup hierarchies to be created with no bound subsystems This patch removes the restriction that a cgroup hierarchy must have at least one bound subsystem. The mount option "none" is treated as an explicit request for no bound subsystems. A hierarchy with no subsystems can be useful for plain task tracking, and is also a step towards the support for multiply-bindable subsystems. As part of this change, the hierarchy id is no longer calculated from the bitmask of subsystems in the hierarchy (since this is not guaranteed to be unique) but is allocated via an ida. Reference counts on cgroups from css_set objects are now taken explicitly one per hierarchy, rather than one per subsystem. Example usage: mount -t cgroup -o none,name=foo cgroup /mnt/cgroup Based on the "no-op"/"none" subsystem concept proposed by kamezawa.hiroyu@jp.fujitsu.com Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Paul Menage	7717f7ba92	cgroups: add a back-pointer from struct cg_cgroup_link to struct cgroup Currently the cgroups code makes the assumption that the subsystem pointers in a struct css_set uniquely identify the hierarchy->cgroup mappings associated with the css_set; and there's no way to directly identify the associated set of cgroups other than by indirecting through the appropriate subsystem state pointers. This patch removes the need for that assumption by adding a back-pointer from struct cg_cgroup_link object to its associated cgroup; this allows the set of cgroups to be determined by traversing the cg_links list in the struct css_set. Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:58 -07:00
Paul Menage	fe6934354f	cgroups: move the cgroup debug subsys into cgroup.c to access internal state While it's architecturally clean to have the cgroup debug subsystem be completely independent of the cgroups framework, it limits its usefulness for debugging the contents of internal data structures. Move the debug subsystem code into the scope of all the cgroups data structures to make more detailed debugging possible. Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:57 -07:00
Paul Menage	c6d57f3312	cgroups: support named cgroups hierarchies To simplify referring to cgroup hierarchies in mount statements, and to allow disambiguation in the presence of empty hierarchies and multiply-bindable subsystems this patch adds support for naming a new cgroup hierarchy via the "name=" mount option A pre-existing hierarchy may be specified by either name or by subsystems; a hierarchy's name cannot be changed by a remount operation. Example usage: # To create a hierarchy called "foo" containing the "cpu" subsystem mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1 # To mount the "foo" hierarchy on a second location mount -t cgroup -oname=foo cgroup /mnt/cgroup2 Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:57 -07:00
Xiaotian Feng	34f77a90f7	cgroups: make unlock sequence in cgroup_get_sb consistent Make the last unlock sequence consistent with previous unlock sequeue. Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:57 -07:00
Zhaolei	57f1f0874f	time: add function to convert between calendar time and broken-down time for universal use There are many similar code in kernel for one object: convert time between calendar time and broken-down time. Here is some source I found: fs/ncpfs/dir.c fs/smbfs/proc.c fs/fat/misc.c fs/udf/udftime.c fs/cifs/netmisc.c net/netfilter/xt_time.c drivers/scsi/ips.c drivers/input/misc/hp_sdc_rtc.c drivers/rtc/rtc-lib.c arch/ia64/hp/sim/boot/fw-emu.c arch/m68k/mac/misc.c arch/powerpc/kernel/time.c arch/parisc/include/asm/rtc.h ... We can make a common function for this type of conversion, At least we can get following benefit: 1: Make kernel simple and unify 2: Easy to fix bug in converting code 3: Reduce clone of code in future For example, I'm trying to make ftrace display walltime, this patch will make me easy. This code is based on code from glibc-2.6 Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Machek <pavel@ucw.cz> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:56 -07:00
Frederic Weisbecker	f3f3f00924	tracing/event: Cleanup the useless dentry variable Cleanup the useless dentry variable while creating a kernel event set of files. trace_create_file() warns if it fails to create the file anyway, and we don't store the dentry anywhere. v2: Fix a small conflict in kernel/trace/trace_events.c Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-09-24 15:27:41 +02:00
Frederic Weisbecker	737f453fd1	tracing/filters: Cleanup useless headers Cleanup remaining headers inclusion that were only useful when the filter framework and its tracing related filesystem user interface weren't yet separated. v2: Keep module.h, needed for EXPORT_SYMBOL_GPL Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2009-09-24 15:21:58 +02:00
Eric Paris	939cbf260c	Audit: send signal info if selinux is disabled Audit will not respond to signal requests if selinux is disabled since it is unable to translate the 0 sid from the sending process to a context. This patch just doesn't send the context info if there isn't any. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 03:50:26 -04:00
Eric Paris	44e51a1b78	Audit: rearrange audit_context to save 16 bytes per struct pahole pointed out that on x86_64 struct audit_context can be rearrainged to save 16 bytes per struct. Since we have an audit_context per task this can acually be a pretty significant gain. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 03:50:26 -04:00
Eric Paris	e08b061ec0	Audit: reorganize struct audit_watch to save 8 bytes pahole showed that struct audit_watch had two holes: struct audit_watch { atomic_t count; /* 0 4 / / XXX 4 bytes hole, try to pack / char path; /* 8 8 / dev_t dev; / 16 4 / / XXX 4 bytes hole, try to pack / long unsigned int ino; / 24 8 / struct audit_parent parent; /* 32 8 / struct list_head wlist; / 40 16 / struct list_head rules; / 56 16 / / --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- / / size: 72, cachelines: 2, members: 7 / / sum members: 64, holes: 2, sum holes: 8 / / last cacheline: 8 bytes / }; / definitions: 1 */ by moving dev after count we save 8 bytes, actually improving cacheline usage. There are typically very few of these in the kernel so it won't be a large savings, but it's a good thing no matter what. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 03:50:25 -04:00
Linus Torvalds	94a8d5caba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (39 commits) cpumask: Move deprecated functions to end of header. cpumask: remove unused deprecated functions, avoid accusations of insanity cpumask: use new-style cpumask ops in mm/quicklist. cpumask: use mm_cpumask() wrapper: x86 cpumask: use mm_cpumask() wrapper: um cpumask: use mm_cpumask() wrapper: mips cpumask: use mm_cpumask() wrapper: mn10300 cpumask: use mm_cpumask() wrapper: m32r cpumask: use mm_cpumask() wrapper: arm cpumask: Use accessors for cpu__mask: um cpumask: Use accessors for cpu__mask: powerpc cpumask: Use accessors for cpu__mask: mips cpumask: Use accessors for cpu__mask: m32r cpumask: remove arch_send_call_function_ipi cpumask: arch_send_call_function_ipi_mask: s390 cpumask: arch_send_call_function_ipi_mask: powerpc cpumask: arch_send_call_function_ipi_mask: mips cpumask: arch_send_call_function_ipi_mask: m32r cpumask: arch_send_call_function_ipi_mask: alpha cpumask: remove obsolete topology_core_siblings and topology_thread_siblings: ia64 ...	2009-09-23 18:14:11 -07:00
Alexey Dobriyan	2bcd57ab61	headers: utsname.h redux * remove asm/atomic.h inclusion from linux/utsname.h -- not needed after kref conversion * remove linux/utsname.h inclusion from files which do not need it NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however due to some personality stuff it _is_ needed -- cowardly leave ELF-related headers and files alone. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 18:13:10 -07:00
Sebastian Andrzej Siewior	95e0d86bad	Revert "kmod: fix race in usermodehelper code" This reverts commit `c02e3f361c` ("kmod: fix race in usermodehelper code") The patch is wrong. UMH_WAIT_EXEC is called with VFORK what ensures that the child finishes prior returing back to the parent. No race. In fact, the patch makes it even worse because it does the thing it claims not do: - It calls ->complete() on UMH_WAIT_EXEC - the complete() callback may de-allocated subinfo as seen in the following call chain: [<c009f904>] (__link_path_walk+0x20/0xeb4) from [<c00a094c>] (path_walk+0x48/0x94) [<c00a094c>] (path_walk+0x48/0x94) from [<c00a0a34>] (do_path_lookup+0x24/0x4c) [<c00a0a34>] (do_path_lookup+0x24/0x4c) from [<c00a158c>] (do_filp_open+0xa4/0x83c) [<c00a158c>] (do_filp_open+0xa4/0x83c) from [<c009ba90>] (open_exec+0x24/0xe0) [<c009ba90>] (open_exec+0x24/0xe0) from [<c009bfa8>] (do_execve+0x7c/0x2e4) [<c009bfa8>] (do_execve+0x7c/0x2e4) from [<c0026a80>] (kernel_execve+0x34/0x80) [<c0026a80>] (kernel_execve+0x34/0x80) from [<c004b514>] (____call_usermodehelper+0x130/0x148) [<c004b514>] (____call_usermodehelper+0x130/0x148) from [<c0024858>] (kernel_thread_exit+0x0/0x8) and the path pointer was NULL. Good that ARM's kernel_execve() doesn't check the pointer for NULL or else I wouldn't notice it. The only race there might be is with UMH_NO_WAIT but it is too late for me to investigate it now. UMH_WAIT_PROC could probably also use VFORK and we could save one exec. So the only race I see is with UMH_NO_WAIT and recent scheduler changes where the child does not always run first might have trigger here something but as I said, it is late.... Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 18:12:10 -07:00
Rusty Russell	0748bd0177	cpumask: remove arch_send_call_function_ipi Now everyone is converted to arch_send_call_function_ipi_mask, remove the shim and the #defines. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-09-24 09:34:47 +09:30
Li Zefan	79f5599772	cpumask: use zalloc_cpumask_var() where possible Remove open-coded zalloc_cpumask_var() and zalloc_cpumask_var_node(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-09-24 09:34:24 +09:30
Linus Torvalds	c82ffab9a8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: SELinux: do not destroy the avc_cache_nodep KEYS: Have the garbage collector set its timer for live expired keys tpm-fixup-pcrs-sysfs-file-update creds_are_invalid() needs to be exported for use by modules: include/linux/cred.h: fix build Fix trivial BUILD_BUG_ON-induced conflicts in drivers/char/tpm/tpm.c	2009-09-23 15:18:57 -07:00
Frederic Weisbecker	d7a4b414ee	Merge commit 'linus/master' into tracing/kprobes Conflicts: kernel/trace/Makefile kernel/trace/trace.h kernel/trace/trace_event_types.h kernel/trace/trace_export.c Merge reason: Sync with latest significant tracing core changes.	2009-09-23 23:08:43 +02:00
Randy Dunlap	764db03fee	creds_are_invalid() needs to be exported for use by modules: ERROR: "creds_are_invalid" [fs/cachefiles/cachefiles.ko] undefined! Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>	2009-09-23 11:02:26 -07:00
Andrew Morton	74908a0009	include/linux/cred.h: fix build mips allmodconfig: include/linux/cred.h: In function `creds_are_invalid': include/linux/cred.h:187: error: `PAGE_SIZE' undeclared (first use in this function) include/linux/cred.h:187: error: (Each undeclared identifier is reported only once include/linux/cred.h:187: error: for each function it appears in.) Fixes commit `b6dff3ec5e` Author: David Howells <dhowells@redhat.com> AuthorDate: Fri Nov 14 10:39:16 2008 +1100 Commit: James Morris <jmorris@namei.org> CommitDate: Fri Nov 14 10:39:16 2008 +1100 CRED: Separate task security context from task_struct I think. It's way too large to be inlined anyway. Dunno if this needs an EXPORT_SYMBOL() yet. Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-09-23 11:01:25 -07:00
Paul E. McKenney	9b2619aff0	rcu: Clean up code to address Ingo's checkpatch feedback Move declarations and update storage classes to make checkpatch happy. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12537246441701-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-23 19:46:30 +02:00
Paul E. McKenney	1eba8f8438	rcu: Clean up code based on review feedback from Josh Triplett, part 2 These issues identified during an old-fashioned face-to-face code review extending over many hours. o Add comments for tricky parts of code, and correct comments that have passed their sell-by date. o Get rid of the vestiges of rcu_init_sched(), which is no longer needed now that PREEMPT_RCU is gone. o Move the #include of rcutree_plugin.h to the end of rcutree.c, which means that, rather than having a random collection of forward declarations, the new set of forward declarations document the set of plugins. The new home for this #include also allows __rcu_init_preempt() to move into rcutree_plugin.h. o Fix rcu_preempt_check_callbacks() to be static. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12537246443924-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu> Peter Zijlstra <peterz@infradead.org>	2009-09-23 19:46:29 +02:00
Paul E. McKenney	fc2219d49e	rcu: Clean up code based on review feedback from Josh Triplett These issues identified during an old-fashioned face-to-face code review extended over many hours. o Bury various forms of the "rsp->completed == rsp->gpnum" comparison into an rcu_gp_in_progress() function, which has the beneficial side-effect of forcing consistent use of ACCESS_ONCE(). o Replace hand-coded arithmetic with DIV_ROUND_UP(). o Bury several "!list_empty(&rnp->blocked_tasks[rnp->gpnum & 0x01])" instances into an rcu_preempted_readers() function, as this expression indicates that there are no readers blocked within RCU read-side critical sections blocking the current grace period. (Though there might well be similar readers blocking the next grace period.) o Remove a dangling rcu_restart_cpu() declaration that has been dangling for almost 20 minor releases of the kernel. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12537246442687-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-23 19:46:29 +02:00
Linus Torvalds	31bbb9b58d	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: itimers: Add tracepoints for itimer hrtimer: Add tracepoint for hrtimers timers: Add tracepoints for timer_list timers cputime: Optimize jiffies_to_cputime(1) itimers: Simplify arm_timer() code a bit itimers: Fix periodic tics precision itimers: Merge ITIMER_VIRT and ITIMER_PROF Trivial header file include conflicts in kernel/fork.c	2009-09-23 09:46:15 -07:00
KAMEZAWA Hiroyuki	908eedc616	walk system ram range Originally, walk_memory_resource() was introduced to traverse all memory of "System RAM" for detecting memory hotplug/unplug range. For doing so, flags of IORESOUCE_MEM\|IORESOURCE_BUSY was used and this was enough for memory hotplug. But for using other purpose, /proc/kcore, this may includes some firmware area marked as IORESOURCE_BUSY \| IORESOUCE_MEM. This patch makes the check strict to find out busy "System RAM". Note: PPC64 keeps their own walk_memory_resouce(), which walk through ppc64's lmb informaton. Because old kclist_add() is called per lmb, this patch makes no difference in behavior, finally. And this patch removes CONFIG_MEMORY_HOTPLUG check from this function. Because pfn_valid() just show "there is memmap or not* and cannot be used for "there is physical memory or not", this function is useful in generic to scan physical memory range. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Américo Wang <xiyou.wangcong@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
Stefani Seibold	d899bf7b55	procfs: provide stack information for threads A patch to give a better overview of the userland application stack usage, especially for embedded linux. Currently you are only able to dump the main process/thread stack usage which is showed in /proc/pid/status by the "VmStk" Value. But you get no information about the consumed stack memory of the the threads. There is an enhancement in the /proc/<pid>/{task/,}/maps and which marks the vm mapping where the thread stack pointer reside with "[thread stack xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value information, because libpthread doesn't set the start of the stack to the top of the mapped area, depending of the pthread usage. A sample output of /proc/<pid>/task/<tid>/maps looks like: 08048000-08049000 r-xp 00000000 03:00 8312 /opt/z 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] a7d12000-a7d13000 ---p 00000000 00:00 0 a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4] a7f13000-a7f14000 ---p 00000000 00:00 0 a7f14000-a7f36000 rw-p 00000000 00:00 0 a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6 a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6 a806c000-a806f000 rw-p 00000000 00:00 0 a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 a8085000-a8088000 rw-p 00000000 00:00 0 a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack] ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] Also there is a new entry "stack usage" in /proc/<pid>/{task/,}/status which will you give the current stack usage in kb. A sample output of /proc/self/status looks like: Name: cat State: R (running) Tgid: 507 Pid: 507 . . . CapBnd: fffffffffffffeff voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 0 Stack usage: 12 kB I also fixed stack base address in /proc/<pid>/{task/,}/stat to the base address of the associated thread stack and not the one of the main process. This makes more sense. [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()] Signed-off-by: Stefani Seibold <stefani@seibold.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:41 -07:00
Mike Frysinger	2a9ad18deb	lockdep: use new arch_is_kernel_data() This allows lockdep to locate symbols that are in arch-specific data sections (such as data in Blackfin on-chip SRAM regions). Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Robin Getz <rgetz@blackfin.uclinux.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:30 -07:00
Mike Frysinger	128e8db38e	kallsyms: use new arch_is_kernel_text() This allows kallsyms to locate symbols that are in arch-specific text sections (such as text in Blackfin on-chip SRAM regions). Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Robin Getz <rgetz@blackfin.uclinux.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:30 -07:00
Jiri Pirko	1f10206cf8	getrusage: fill ru_maxrss value Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater mark. This struct is filled as a parameter to getrusage syscall. ->ru_maxrss value is set to KBs which is the way it is done in BSD systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs which seems to be incorrect behavior. Maintainer of this util was notified by me with the patch which corrects it and cc'ed. To make this happen we extend struct signal_struct by two fields. The first one is ->maxrss which we use to store rss hiwater of the task. The second one is ->cmaxrss which we use to store highest rss hiwater of all task childs. These values are used in k_getrusage() to actually fill ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm struct exists. Note: exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss. it is intetionally behavior. BSD getrusage have exec() inheriting. test programs ======================================================== getrusage.c =========== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) int main(int argc, char* argv) { int status; printf("allocate 100MB\n"); consume(100); printf("testcase1: fork inherit? \n"); printf(" expect: initial.self ~= child.self\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase2: fork inherit? (cont.) \n"); printf(" expect: initial.children ~= 100MB, but child.children = 0\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("child"); _exit(0); } printf("\n"); printf("testcase3: fork + malloc \n"); printf(" expect: child.self ~= initial.self + 50MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { printf("allocate +50MB\n"); consume(50); show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase4: grandchild maxrss\n"); printf(" expect: post_wait.children ~= 300MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); show_rusage("post_wait"); } else { system("./child -n 0 -g 300"); _exit(0); } printf("\n"); printf("testcase5: zombie\n"); printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n"); printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n"); show_rusage("initial"); if (__fork()) { sleep(1); /* children become zombie / show_rusage("pre_wait"); wait(&status); show_rusage("post_wait"); } else { system("./child -n 400"); _exit(0); } printf("\n"); printf("testcase6: SIG_IGN\n"); printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n"); show_rusage("initial"); signal(SIGCHLD, SIG_IGN); if (__fork()) { sleep(1); / children become zombie / show_rusage("after_zombie"); } else { system("./child -n 500"); _exit(0); } printf("\n"); signal(SIGCHLD, SIG_DFL); printf("testcase7: exec (without fork) \n"); printf(" expect: initial ~= exec \n"); show_rusage("initial"); execl("./child", "child", "-v", NULL); return 0; } child.c ======= #include <sys/types.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include "common.h" int main(int argc, char* argv) { int status; int c; long consume_size = 0; long grandchild_consume_size = 0; int show = 0; while ((c = getopt(argc, argv, "n:g:v")) != -1) { switch (c) { case 'n': consume_size = atol(optarg); break; case 'v': show = 1; break; case 'g': grandchild_consume_size = atol(optarg); break; default: break; } } if (show) show_rusage("exec"); if (consume_size) { printf("child alloc %ldMB\n", consume_size); consume(consume_size); } if (grandchild_consume_size) { if (fork()) { wait(&status); } else { printf("grandchild alloc %ldMB\n", grandchild_consume_size); consume(grandchild_consume_size); exit(0); } } return 0; } common.c ======== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) void show_rusage(char prefix) { int err, err2; struct rusage rusage_self; struct rusage rusage_children; printf("%s: ", prefix); err = getrusage(RUSAGE_SELF, &rusage_self); if (!err) printf("self %ld ", rusage_self.ru_maxrss); err2 = getrusage(RUSAGE_CHILDREN, &rusage_children); if (!err2) printf("children %ld ", rusage_children.ru_maxrss); printf("\n"); } / Some buggy OS need this worthless CPU waste. / void make_pagefault(void) { void addr; int size = getpagesize(); int i; for (i=0; i<1000; i++) { addr = mmap(NULL, size, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANON, -1, 0); if (addr == MAP_FAILED) err("make_pagefault"); memset(addr, 0, size); munmap(addr, size); } } void consume(int mega) { size_t sz = mega * 1024 * 1024; void ptr; ptr = malloc(sz); memset(ptr, 0, sz); make_pagefault(); } pid_t __fork(void) { pid_t pid; pid = fork(); make_pagefault(); return pid; } common.h ======== void show_rusage(char prefix); void make_pagefault(void); void consume(int mega); pid_t __fork(void); FreeBSD result (expected result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 103492 children 0 fork child: self 103540 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 103540 children 103540 child: self 103564 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 103564 children 103564 allocate +50MB fork child: self 154860 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 103564 children 154860 grandchild alloc 300MB post_wait: self 103564 children 308720 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 103564 children 308720 child alloc 400MB pre_wait: self 103564 children 308720 post_wait: self 103564 children 411312 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 103564 children 411312 child alloc 500MB after_zombie: self 103624 children 411312 testcase7: exec (without fork) expect: initial ~= exec initial: self 103624 children 411312 exec: self 103624 children 411312 Linux result (actual test result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 102848 children 0 fork child: self 102572 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 102876 children 102644 child: self 102572 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 102876 children 102644 allocate +50MB fork child: self 153804 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 102876 children 153864 grandchild alloc 300MB post_wait: self 102876 children 307536 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 102876 children 307536 child alloc 400MB pre_wait: self 102876 children 307536 post_wait: self 102876 children 410076 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 102876 children 410076 child alloc 500MB after_zombie: self 102880 children 410076 testcase7: exec (without fork) expect: initial ~= exec initial: self 102880 children 410076 exec: self 102880 children 410076 Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:30 -07:00
Scott James Remnant	02b51df1b0	proc connector: add event for process becoming session leader The act of a process becoming a session leader is a useful signal to a supervising init daemon such as Upstart. While a daemon will normally do this as part of the process of becoming a daemon, it is rare for its children to do so. When the children do, it is nearly always a sign that the child should be considered detached from the parent and not supervised along with it. The poster-child example is OpenSSH; the per-login children call setsid() so that they may control the pty connected to them. If the primary daemon dies or is restarted, we do not want to consider the per-login children and want to respawn the primary daemon without killing the children. This patch adds a new PROC_SID_EVENT and associated structure to the proc_event event_data union, it arranges for this to be emitted when the special PIDTYPE_SID pid is set. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Scott James Remnant <scott@ubuntu.com> Acked-by: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Acked-by: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
James Morris	88e9d34c72	seq_file: constify seq_operations Make all seq_operations structs const, to help mitigate against revectoring user-triggerable function pointers. This is derived from the grsecurity patch, although generated from scratch because it's simpler than extracting the changes from there. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:29 -07:00
Xiao Guangrong	54fdade1c3	generic-ipi: make struct call_function_data lockless This patch can remove spinlock from struct call_function_data, the reasons are below: 1: add a new interface for cpumask named cpumask_test_and_clear_cpu(), it can atomically test and clear specific cpu, we can use it instead of cpumask_test_cpu() and cpumask_clear_cpu() and no need data->lock to protect those in generic_smp_call_function_interrupt(). 2: in smp_call_function_many(), after csd_lock() return, the current's cfd_data is deleted from call_function list, so it not have race between other cpus, then cfs_data is only used in smp_call_function_many() that must disable preemption and not from a hardware interrupthandler or from a bottom half handler to call, only the correspond cpu can use it, so it not have race in current cpu, no need cfs_data->lock to protect it. 3: after 1 and 2, cfs_data->lock is only use to protect cfs_data->refs in generic_smp_call_function_interrupt(), so we can define cfs_data->refs to atomic_t, and no need cfs_data->lock any more. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> [akpm@linux-foundation.org: use atomic_dec_return()] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:28 -07:00
Neil Horman	c02e3f361c	kmod: fix race in usermodehelper code The user mode helper code has a race in it. call_usermodehelper_exec() takes an allocated subprocess_info structure, which it passes to a workqueue, and then passes it to a kernel thread which it creates, after which it calls complete to signal to the caller of call_usermodehelper_exec() that it can free the subprocess_info struct. But since we use that structure in the created thread, we can't call complete from __call_usermodehelper(), which is where we create the kernel thread. We need to call complete() from within the kernel thread and then not use subprocess_info afterward in the case of UMH_WAIT_EXEC. Tested successfully by me. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:28 -07:00
Dave Young	af91322ef3	printk: add printk_delay to make messages readable for some scenarios When syslog is not possible, at the same time there's no serial/net console available, it will be hard to read the printk messages. For example oops/panic/warning messages in shutdown phase. Add a printk delay feature, we can make each printk message delay some milliseconds. Setting the delay by proc/sysctl interface: /proc/sys/kernel/printk_delay The value range from 0 - 10000, default value is 0 [akpm@linux-foundation.org: fix a few things] Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:28 -07:00
Dave Young	3a3b6ed223	printk boot_delay: rename printk_delay_msec to loops_per_msec Rename `printk_delay_msec' to `loops_per_msec', because the patch "printk: add printk_delay to make messages readable for some scenarios" wishes to more appropriately use the `printk_delay_msec' identifier. [akpm@linux-foundation.org: add a comment] Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-23 07:39:28 -07:00
Ingo Molnar	115e8a2882	modules, tracing: Remove stale struct marker signature from module_layout() Linus reported this new build warning: kernel/module.c:2951: warning: ?struct marker? declared inside parameter list kernel/module.c:2951: warning: its scope is only this definition or declaration, which is probably not what you want Caused by: `fc53776`: tracing: Remove markers module_layout() is an artificial symbol with 'significant' symbols listed in its argument list so that it gets a proper argument types signature that modversions can pick up to decide whether a module is version-compatible or not. If these dont match then we wont even look at a module. Remove the stale marker symbol. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <alpine.LFD.2.01.0909210908020.4950@localhost.localdomain> Cc: Christoph Hellwig <hch@lst.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-23 10:34:21 +02:00
Linus Torvalds	342ff1a1b5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) trivial: fix typo in aic7xxx comment trivial: fix comment typo in drivers/ata/pata_hpt37x.c trivial: typo in kernel-parameters.txt trivial: fix typo in tracing documentation trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c trivial: remove unnecessary semicolons trivial: Fix duplicated word "options" in comment trivial: kbuild: remove extraneous blank line after declaration of usage() trivial: improve help text for mm debug config options trivial: doc: hpfall: accept disk device to unload as argument trivial: doc: hpfall: reduce risk that hpfall can do harm trivial: SubmittingPatches: Fix reference to renumbered step trivial: fix typos "man[ae]g?ment" -> "management" trivial: media/video/cx88: add __init/__exit macros to cx88 drivers trivial: fix typo in CONFIG_DEBUG_FS in gcov doc trivial: fix missing printk space in amd_k7_smp_check trivial: fix typo s/ketymap/keymap/ in comment trivial: fix typo "to to" in multiple files trivial: fix typos in comments s/DGBU/DBGU/ ...	2009-09-22 07:51:45 -07:00
Ingo Molnar	3fff4c42bd	printk: Remove ratelimit.h from kernel.h Decouple kernel.h from ratelimit.h: the global declaration of printk's ratelimit_state is not needed, and it leads to messy circular dependencies due to ratelimit.h's (new) adding of a spinlock_types.h include. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 16:18:09 +02:00
Arjan van de Ven	69d25870f2	cpuidle: fix the menu governor to boost IO performance Fix the menu idle governor which balances power savings, energy efficiency and performance impact. The reason for a reworked governor is that there have been serious performance issues reported with the existing code on Nehalem server systems. To show this I'm sure Andrew wants to see benchmark results: (benchmark is "fio", "no cstates" is using "idle=poll") no cstates current linux new algorithm 1 disk 107 Mb/s 85 Mb/s 105 Mb/s 2 disks 215 Mb/s 123 Mb/s 209 Mb/s 12 disks 590 Mb/s 320 Mb/s 585 Mb/s In various power benchmark measurements, no degredation was found by our measurement&diagnostics team. Obviously a small percentage more power was used in the "fio" benchmark, due to the much higher performance. While it would be a novel idea to describe the new algorithm in this commit message, I cheaped out and described it in comments in the code instead. [changes since first post: spelling fixes from akpm, review feedback, folded menu-tng into menu.c] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:45 -07:00
Bernd Schmidt	eb8cdec4a9	nommu: add support for Memory Protection Units (MPU) Some architectures (like the Blackfin arch) implement some of the "simpler" features that one would expect out of a MMU such as memory protection. In our case, we actually get read/write/exec protection down to the page boundary so processes can't stomp on each other let alone the kernel. There is a performance decrease (which depends greatly on the workload) however as the hardware/software interaction was not optimized at design time. Signed-off-by: Bernd Schmidt <bernds_cb1@t-online.de> Signed-off-by: Bryan Wu <cooloney@kernel.org> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:43 -07:00
KOSAKI Motohiro	28b83c5193	oom: move oom_adj value from task_struct to signal_struct Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:39 -07:00
Alexey Dobriyan	1a8670a29b	oom: move oom_killer_enable()/oom_killer_disable to where they belong Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Jan Beulich	2c85f51d22	mm: also use alloc_large_system_hash() for the PID hash table This is being done by allowing boot time allocations to specify that they may want a sub-page sized amount of memory. Overall this seems more consistent with the other hash table allocations, and allows making two supposedly mm-only variables really mm-only (nr_{kernel,all}_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Jan Beulich	3c1596efe1	mm: don't use alloc_bootmem_low() where not strictly needed Since alloc_bootmem() will never return inaccessible (via virtual addressing) memory anyway, using the ..._low() variant only makes sense when the physical address range of the allocated memory must fulfill further constraints, espacially since on 64-bits (or more generally in all cases where the pools the two variants allocate from are than the full available range. Probably the use in alloc_tce_table() could also be eliminated (based on code inspection of pci-calgary_64.c), but that seems too risky given I know nothing about that hardware and have no way to test it. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:38 -07:00
Andrea Arcangeli	1c2fb7a4c2	ksm: fix deadlock with munlock in exit_mmap Rawhide users have reported hang at startup when cryptsetup is run: the same problem can be simply reproduced by running a program int main() { mlockall(MCL_CURRENT \| MCL_FUTURE); return 0; } The problem is that exit_mmap() applies munlock_vma_pages_all() to clean up VM_LOCKED areas, and its current implementation (stupidly) tries to fault in absent pages, for example where PROT_NONE prevented them being faulted in when mlocking. Whereas the "ksm: fix oom deadlock" patch, knowing there's a race by which KSM might try to fault in pages after exit_mmap() had finally zapped the range, backs out of such faults doing nothing when its ksm_test_exit() notices mm_users 0. So revert that part of "ksm: fix oom deadlock" which moved the ksm_exit() call from before exit_mmap() to the middle of exit_mmap(); and remove those ksm_test_exit() checks from the page fault paths, so allowing the munlocking to proceed without interference. ksm_exit, if there are rmap_items still chained on this mm slot, takes mmap_sem write side: so preventing KSM from working on an mm while exit_mmap runs. And KSM will bail out as soon as it notices that mm_users is already zero, thanks to its internal ksm_test_exit checks. So that when a task is killed by OOM killer or the user, KSM will not indefinitely prevent it from running exit_mmap to release its memory. This does break a part of what "ksm: fix oom deadlock" was trying to achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even when ksmd itself has to cancel a KSM page, it is possible that the first OOM-kill victim would be the KSM process being faulted: then its memory won't be freed until a second victim has been selected (freeing memory for the unmerging fault to complete). But the OOM killer is already liable to kill a second victim once the intended victim's p->mm goes to NULL: so there's not much point in rejecting this KSM patch before fixing that OOM behaviour. It is very much more important to allow KSM users to boot up, than to haggle over an unlikely and poorly supported OOM case. We also intend to fix munlocking to not fault pages: at which point this patch _could_ be reverted; though that would be controversial, so we hope to find a better solution. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Justin M. Forbes <jforbes@redhat.com> Acked-for-now-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	9ba6929480	ksm: fix oom deadlock There's a now-obvious deadlock in KSM's out-of-memory handling: imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex, trying to allocate a page to break KSM in an mm which becomes the OOM victim (quite likely in the unmerge case): it's killed and goes to exit, and hangs there waiting to acquire ksm_thread_mutex. Clearly we must not require ksm_thread_mutex in __ksm_exit, simple though that made everything else: perhaps use mmap_sem somehow? And part of the answer lies in the comments on unmerge_ksm_pages: __ksm_exit should also leave all the rmap_item removal to ksmd. But there's a fundamental problem, that KSM relies upon mmap_sem to guarantee the consistency of the mm it's dealing with, yet exit_mmap tears down an mm without taking mmap_sem. And bumping mm_users won't help at all, that just ensures that the pages the OOM killer assumes are on their way to being freed will not be freed. The best answer seems to be, to move the ksm_exit callout from just before exit_mmap, to the middle of exit_mmap: after the mm's pages have been freed (if the mmu_gather is flushed), but before its page tables and vma structures have been freed; and down_write,up_write mmap_sem there to serialize with KSM's own reliance on mmap_sem. But KSM then needs to be careful, whenever it downs mmap_sem, to check that the mm is not already exiting: there's a danger of using find_vma on a layout that's being torn apart, or writing into page tables which have been freed for reuse; and even do_anonymous_page and __do_fault need to check they're not being called by break_ksm to reinstate a pte after zap_pte_range has zapped that page table. Though it might be clearer to add an exiting flag, set while holding mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating a zapped pte. All we need is to check whether mm_users is 0 - but must remember that ksmd may detect that before __ksm_exit is reached. So, ksm_test_exit(mm) added to comment such checks on mm->mm_users. __ksm_exit now has to leave clearing up the rmap_items to ksmd, that needs ksm_thread_mutex; but shift the exiting mm just after the ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise mm_count to hold the mm_struct, ksmd's exit processing (exactly like its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it, similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd). But also give __ksm_exit a fast path: when there's no complication (no rmap_items attached to mm and it's not at the ksm_scan cursor), it can safely do all the exiting work itself. This is not just an optimization: when ksmd is not running, the raised mm_count would otherwise leak mm_structs. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:32 -07:00
Hugh Dickins	f8af4da3b4	ksm: the mm interface to ksm This patch presents the mm interface to a dummy version of ksm.c, for better scrutiny of that interface: the real ksm.c follows later. When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and MADV_UNMERGEABLE with EINVAL, since that seems more helpful than pretending that they can be serviced. But when CONFIG_KSM=y, accept them even if KSM is not currently running, and even on areas which KSM will not touch (e.g. hugetlb or shared file or special driver mappings). Like other madvices, report ENOMEM despite success if any area in the range is unmapped, and use EAGAIN to report out of memory. Define vma flag VM_MERGEABLE to identify an area on which KSM may try merging pages: leave it to ksm_madvise() to decide whether to set it. Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain VM_MERGEABLE areas, to minimize callouts when forking or exiting. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Izik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:31 -07:00
KOSAKI Motohiro	c6a7f5728a	mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log output The amount of memory allocated to kernel stacks can become significant and cause OOM conditions. However, we do not display the amount of memory consumed by stacks. Add code to display the amount of memory used for stacks in /proc/meminfo. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:27 -07:00
Alexey Dobriyan	6e1d5dcc2b	const: mark remaining inode_operations as const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Alexey Dobriyan	b87221de6a	const: mark remaining super_operations const Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-22 07:17:24 -07:00
Darren Hart	0729e19614	futex: Fix wakeup race by setting TASK_INTERRUPTIBLE before queue_me() PI futexes do not use the same plist_node_empty() test for wakeup. It was possible for the waiter (in futex_wait_requeue_pi()) to set TASK_INTERRUPTIBLE after the waker assigned the rtmutex to the waiter. The waiter would then note the plist was not empty and call schedule(). The task would not be found by any subsequeuent futex wakeups, resulting in a userspace hang. By moving the setting of TASK_INTERRUPTIBLE to before the call to queue_me(), the race with the waker is eliminated. Since we no longer call get_user() from within queue_me(), there is no need to delay the setting of TASK_INTERRUPTIBLE until after the call to queue_me(). The FUTEX_LOCK_PI operation is not affected as futex_lock_pi() relies entirely on the rtmutex code to handle schedule() and wakeup. The requeue PI code is affected because the waiter starts as a non-PI waiter and is woken on a PI futex. Remove the crusty old comment about holding spinlocks() across get_user() as we no longer do that. Correct the locking statement with a description of why the test is performed. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053038.8717.97838.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:37:44 +02:00
Darren Hart	d8d88fbb18	futex: Correct futex_q woken state commentary Use kernel-doc format to describe struct futex_q. Correct the wakeup definition to eliminate the statement about waking the waiter between the plist_del() and the q->lock_ptr = 0. Note in the comment that PI futexes have a different definition of the woken state. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053029.8717.62798.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:37:44 +02:00
Darren Hart	d96ee56ce0	futex: Make function kernel-doc commentary consistent Make the existing function kernel-doc consistent throughout futex.c, following Documentation/kernel-doc-nano-howto.txt as closely as possible. When unsure, at least be consistent within futex.c. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053022.8717.13339.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:37:43 +02:00
Darren Hart	d40d65c8db	futex: Correct queue_me and unqueue_me commentary The queue_me/unqueue_me commentary is oddly placed and out of date. Clean it up and correct the inaccurate bits. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053015.8717.71713.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:37:43 +02:00
Darren Hart	56ec1607b1	futex: Correct futex_wait_requeue_pi() commentary Correct various typos and formatting inconsistencies in the commentary of futex_wait_requeue_pi(). Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922052958.8717.21932.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:37:42 +02:00
Li Zefan	79fe249c83	tracing: Fix failure path in ftrace_regex_open() Don't forget to free trace_parser if seq_open() returned failure. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4AB86694.4040803@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:28:57 +02:00
Li Zefan	1eb90f138b	tracing: Fix failure path in ftrace_graph_write() Don't call trace_parser_put() on uninitialized trace_parser. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4AB86639.3000003@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:28:56 +02:00
Li Zefan	4ba7978e98	tracing: Check the return value of trace_get_user() Return immediately if trace_get_user() returned failure. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4AB86614.7020803@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:28:55 +02:00
Li Zefan	3c235a337e	tracing: Fix off-by-one in trace_get_user() Leave the last slot for the tailing '\0'. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4AB865FA.5080801@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-22 10:28:53 +02:00
Andrew Morton	dedcf2971c	net: fix CONFIG_NET=n build on sparc64 sparc64 allnoconfig: arch/sparc/kernel/built-in.o(.text+0x134e0): In function `sys32_recvfrom': : undefined reference to `compat_sys_recvfrom' arch/sparc/kernel/built-in.o(.text+0x134e4): In function `sys32_recvfrom': : undefined reference to `compat_sys_recvfrom' Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-21 11:32:27 -07:00
Linus Torvalds	43c1266ce4	Merge branch 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Tidy up after the big rename perf: Do the big rename: Performance Counters -> Performance Events perf_counter: Rename 'event' to event_id/hw_event perf_counter: Rename list_entry -> group_entry, counter_list -> group_list Manually resolved some fairly trivial conflicts with the tracing tree in include/trace/ftrace.h and kernel/trace/trace_syscalls.c.	2009-09-21 09:15:07 -07:00
Linus Torvalds	b8c7f1dc5c	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Fix whitespace inconsistencies rcu: Fix thinko, actually initialize full tree rcu: Apply results of code inspection of kernel/rcutree_plugin.h rcu: Add WARN_ON_ONCE() consistency checks covering state transitions rcu: Fix synchronize_rcu() for TREE_PREEMPT_RCU rcu: Simplify rcu_read_unlock_special() quiescent-state accounting rcu: Add debug checks to TREE_PREEMPT_RCU for premature grace periods rcu: Kconfig help needs to say that TREE_PREEMPT_RCU scales down rcutorture: Occasionally delay readers enough to make RCU force_quiescent_state rcu: Initialize multi-level RCU grace periods holding locks rcu: Need to update rnp->gpnum if preemptable RCU is to be reliable	2009-09-21 09:06:52 -07:00
Linus Torvalds	8e4bc3dd2c	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Simplify sys_sched_rr_get_interval() system call sched: Fix potential NULL derference of doms_cur sched: Fix raciness in runqueue_is_locked() sched: Re-add lost cpu_allowed check to sched_fair.c::select_task_rq_fair() sched: Remove unneeded indentation in sched_fair.c::place_entity()	2009-09-21 09:06:17 -07:00
Linus Torvalds	bd4c3a3441	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file tracing: Export trace_profile_buf symbols tracing/events: use list_for_entry_continue tracing: remove max_tracer_type_len function-graph: use ftrace_graph_funcs directly tracing: Remove markers tracing: Allocate the ftrace event profile buffer dynamically tracing: Factorize the events profile accounting	2009-09-21 09:05:47 -07:00
Joe Perches	a419aef8b8	trivial: remove unnecessary semicolons Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:58 +02:00
Uwe Kleine-Koenig	2944fcbe03	trivial: Fix duplicated word "options" in comment this was introduced in `5e0a093` (tracing: fix config options to not show when automatically selected) Signed-off-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: trivial@kernel.org Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:58 +02:00
Anand Gadiyar	fd589a8f0a	trivial: fix typo "to to" in multiple files Signed-off-by: Anand Gadiyar <gadiyar@ti.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:55 +02:00
Robert P. J. Day	fe002a4197	trivial: Correct print_tainted routine name in comment Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-09-21 15:14:53 +02:00
Ingo Molnar	57c0c15b52	perf: Tidy up after the big rename - provide compatibility Kconfig entry for existing PERF_COUNTERS .config's - provide courtesy copy of old perf_counter.h, for user-space projects - small indentation fixups - fix up MAINTAINERS - fix small x86 printout fallout - fix up small PowerPC comment fallout (use 'counter' as in register) Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 14:34:11 +02:00
Michal Simek	1f74b1f7e5	microblaze: Enable GCOV_PROFILE_ALL Signed-off-by: Michal Simek <monstr@monstr.eu>	2009-09-21 14:29:21 +02:00
Ingo Molnar	cdd6c482c9	perf: Do the big rename: Performance Counters -> Performance Events Bye-bye Performance Counters, welcome Performance Events! In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility. Naming its core object 'perf_counter' and naming the subsystem 'perfcounters' has become more and more of a misnomer. With pending code like hw-breakpoints support the 'counter' name is less and less appropriate. All in one, we've decided to rename the subsystem to 'performance events' and to propagate this rename through all fields, variables and API names. (in an ABI compatible fashion) The word 'event' is also a bit shorter than 'counter' - which makes it slightly more convenient to write/handle as well. Thanks goes to Stephane Eranian who first observed this misnomer and suggested a rename. User-space tooling and ABI compatibility is not affected - this patch should be function-invariant. (Also, defconfigs were not touched to keep the size down.) This patch has been generated via the following script: FILES=$(find * -type f \| grep -vE 'oprofile\|[^K]config') sed -i \ -e 's/PERF_EVENT_/PERF_RECORD_/g' \ -e 's/PERF_COUNTER/PERF_EVENT/g' \ -e 's/perf_counter/perf_event/g' \ -e 's/nb_counters/nb_events/g' \ -e 's/swcounter/swevent/g' \ -e 's/tpcounter_event/tp_event/g' \ $FILES for N in $(find . -name perf_counter.[ch]); do M=$(echo $N \| sed 's/perf_counter/perf_event/g') mv $N $M done FILES=$(find . -name perf_event.*) sed -i \ -e 's/COUNTER_MASK/REG_MASK/g' \ -e 's/COUNTER/EVENT/g' \ -e 's/\<event\>/event_id/g' \ -e 's/counter/event/g' \ -e 's/Counter/Event/g' \ $FILES ... to keep it as correct as possible. This script can also be used by anyone who has pending perfcounters patches - it converts a Linux kernel tree over to the new naming. We tried to time this change to the point in time where the amount of pending patches is the smallest: the end of the merge window. Namespace clashes were fixed up in a preparatory patch - and some stylistic fallout will be fixed up in a subsequent patch. ( NOTE: 'counters' are still the proper terminology when we deal with hardware registers - and these sed scripts are a bit over-eager in renaming them. I've undone some of that, but in case there's something left where 'counter' would be better than 'event' we can undo that on an individual basis instead of touching an otherwise nicely automated patch. ) Suggested-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 14:28:04 +02:00
Ingo Molnar	dfc65094d0	perf_counter: Rename 'event' to event_id/hw_event In preparation to the renames, to avoid a namespace clash. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 12:54:59 +02:00
Ingo Molnar	65abc8653c	perf_counter: Rename list_entry -> group_entry, counter_list -> group_list This is in preparation of the big rename, but also makes sense in a standalone way: 'list_entry' is a bad name as we already have a list_entry() in list.h. Also, the 'counter list' is too vague, it doesnt tell us the purpose of that list. Clarify these names to show that it's all about the group hiearchy. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 12:54:51 +02:00
Heiko Carstens	d01d482785	sched: Always show Cpus_allowed field in /proc/<pid>/status The Cpus_allowed fields in /proc/<pid>/status is currently only shown in case of CONFIG_CPUSETS. However their contents are also useful for the !CONFIG_CPUSETS case. So change the current behaviour and always show these fields. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20090921090627.GD4649@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 11:37:27 +02:00
Peter Williams	0d721ceadb	sched: Simplify sys_sched_rr_get_interval() system call By removing the need for it to know details of scheduling classes. This allows PlugSched to define orthogonal scheduling classes. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <06d1b89ee15a0eef82d7.1253496713@mudlark.pw.nest> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-21 09:53:55 +02:00
Linus Torvalds	ebc79c4f8d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jaswinder/linux-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jaswinder/linux-2.6: includecheck fix: x86, cpu/common.c includecheck fix: kernel/trace, ring_buffer.c includecheck fix: include/linux, ftrace.h includecheck fix: include/linux, page_cgroup.h includecheck fix: include/linux, aio.h includecheck fix: include/drm, drm_memory.h includecheck fix: include/acpi, acpi_bus.h includecheck fix: drivers/xen, evtchn.c includecheck fix: drivers/video, vgacon.c includecheck fix: drivers/scsi, ibmvscsi.c includecheck fix: drivers/scsi, libfcoe.c includecheck fix: x86, shadow.c includecheck fix: x86, traps.c includecheck fix: um, helper.c includecheck fix: s390, sys_s390.c	2009-09-20 16:02:06 -07:00
Linus Torvalds	e11c675ede	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (79 commits) USB serial: update the console driver usb-serial: straighten out serial_open usb-serial: add missing tests and debug lines usb-serial: rename subroutines usb-serial: fix termios initialization logic usb-serial: acquire references when a new tty is installed usb-serial: change logic of serial lookups usb-serial: put subroutines in logical order usb-serial: change referencing of port and serial structures tty: Char: mxser, use THRE for ASPP_OQUEUE ioctl tty: Char: mxser, add support for CP112UL uartlite: support shared interrupt lines tty: USB: serial/mct_u232, fix tty refcnt tty: riscom8, fix tty refcnt tty: riscom8, fix shutdown declaration TTY: fix typos tty: Power: fix suspend vt regression tty: vt: use printk_once tty: handle VT specific compat ioctls in vt driver n_tty: move echoctl check and clean up logic ...	2009-09-20 15:55:17 -07:00
Linus Torvalds	467f9957d9	Merge branch 'perfcounters-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (58 commits) perf_counter: Fix perf_copy_attr() pointer arithmetic perf utils: Use a define for the maximum length of a trace event perf: Add timechart help text and add timechart to "perf help" tracing, x86, cpuidle: Move the end point of a C state in the power tracer perf utils: Be consistent about minimum text size in the svghelper perf timechart: Add "perf timechart record" perf: Add the timechart tool perf: Add a SVG helper library file tracing, perf: Convert the power tracer into an event tracer perf: Add a sample_event type to the event_union perf: Allow perf utilities to have "callback" options without arguments perf: Store trace event name/id pairs in perf.data perf: Add a timestamp to fork events sched_clock: Make it NMI safe perf_counter: Fix up swcounter throttling x86, perf_counter, bts: Optimize BTS overflow handling perf sched: Add --input=file option to builtin-sched.c perf trace: Sample timestamp and cpu when using record flag perf tools: Increase MAX_EVENT_LENGTH perf tools: Fix memory leak in read_ftrace_printk() ...	2009-09-20 15:54:37 -07:00
Yong Zhang	cb5fd13f11	sched: Fix potential NULL derference of doms_cur If CONFIG_CPUMASK_OFFSTACK is enabled but doms_cur alloc failed in arch_init_sched_domains(), doms_cur will move back to fallback_doms. But this time, fallback_doms has not been initialized yet. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: a.p.zijlstra@chello.nl LKML-Reference: <1252930816-7672-1-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-20 20:20:30 +02:00
Alexey Dobriyan	583a22e7c1	kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-20 20:15:40 +02:00
Andrew Morton	89f19f04dc	sched: Fix raciness in runqueue_is_locked() runqueue_is_locked() is unavoidably racy due to a poor interface design. It does cpu = get_cpu() ret = some_perpcu_thing(cpu); put_cpu(cpu); return ret; Its return value is unreliable. Fix. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <200909191855.n8JItiko022148@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-09-20 20:00:32 +02:00

... 19 20 21 22 23 ...

10270 Commits