linux

Author	SHA1	Message	Date
Ingo Molnar	d7619fe39d	Merge branch 'linus' into core/urgent	2011-08-04 09:09:27 +02:00
Oleg Nesterov	a7295898a1	taskstats: add_del_listener() should ignore !valid listeners When send_cpu_listeners() finds the orphaned listener it marks it as !valid and drops listeners->sem. Before it takes this sem for writing, s->pid can be reused and add_del_listener() can wrongly try to re-use this entry. Change add_del_listener() to check ->valid = T. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-03 14:25:20 -10:00
Oleg Nesterov	dfc428b656	taskstats: add_del_listener() shouldn't use the wrong node 1. Commit `26c4caea9d` "don't allow duplicate entries in listener mode" changed add_del_listener(REGISTER) so that "next_cpu:" can reuse the listener allocated for the previous cpu, this doesn't look exactly right even if minor. Change the code to kfree() in the already-registered case, this case is unlikely anyway so the extra kmalloc_node() shouldn't hurt but looke more correct and clean. 2. use the plain list_for_each_entry() instead of _safe() to scan listeners->list. 3. Remove the unneeded INIT_LIST_HEAD(&s->list), we are going to list_add(&s->list). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-03 14:25:20 -10:00
Linus Torvalds	72f9adfd20	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kdb,kgdb: Allow arbitrary kgdb magic knock sequences kdb: Remove all references to DOING_KGDB2 kdb,kgdb: Implement switch and pass buffer from kdb -> gdb kdb: cleanup unused variables missed in the original kdb merge	2011-08-01 13:39:40 -10:00
Jason Wessel	37f86b469d	kdb,kgdb: Allow arbitrary kgdb magic knock sequences The first packet that gdb sends when the kernel is in kdb mode seems to change with every release of gdb. Instead of continuing to add many different gdb packets, change kdb to automatically look for any thing that looks like a gdb packet. Example 1 cold start test: echo g > /proc/sysrq-trigger $D#44+ Example 2 cold start test: echo g > /proc/sysrq-trigger $3#33 The second one should re-enter kdb's shell right away and is purely a test. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2011-08-01 13:23:59 -05:00
Jason Wessel	d613d828e8	kdb: Remove all references to DOING_KGDB2 The DOING_KGDB2 was originally a state variable for one of the two ways to automatically transition from kdb to kgdb. Purge all these variables and just use one single state for the transition. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2011-08-01 13:23:59 -05:00
Jason Wessel	f679c4985b	kdb,kgdb: Implement switch and pass buffer from kdb -> gdb When switching from kdb mode to kgdb mode packets were getting lost depending on the size of the fifo queue of the serial chip. When gdb initially connects if it is in kdb mode it should entirely send any character buffer over to the gdbstub when switching connections. Previously kdb was zero'ing out the character buffer and this could lead to gdb failing to connect at all, or a lengthy pause could occur on the initial connect. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2011-08-01 13:23:59 -05:00
Jason Wessel	3bdb65ec95	kdb: cleanup unused variables missed in the original kdb merge The BTARGS and BTSYMARG variables do not have any function in the mainline version of kdb. Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2011-08-01 13:23:58 -05:00
Linus Torvalds	968e75fc13	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k/math-emu: Remove unnecessary code m68k/math-emu: Remove commented out old code m68k: Kill warning in setup_arch() when compiling for Sun3 m68k/atari: Prefix GPIO_{IN,OUT} with CODEC_ sparc: iounmap() and _free_coherent() - Use lookup_resource() m68k/atari: Reserve some ST-RAM early on for device buffer use m68k/amiga: Chip RAM - Use lookup_resource() resources: Add lookup_resource() sparc: _sparc_find_resource() should check for exact matches m68k/amiga: Chip RAM - Offset resource end by CHIP_PHYSADDR m68k/amiga: Chip RAM - Use resource_size() to fix off-by-one error m68k/amiga: Chip RAM - Change chipavail to an atomic_t m68k/amiga: Chip RAM - Always allocate from the start of memory m68k/amiga: Chip RAM - Convert from printk() to pr_() m68k/amiga: Chip RAM - Use tabs for indentation	2011-07-31 14:30:59 -10:00
Geert Uytterhoeven	1c388919d8	resources: Add lookup_resource() Add a function to find an existing resource by a resource start address. This allows to implement simple allocators (with a malloc/free-alike API) on top of the resource system. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>	2011-07-30 21:21:39 +02:00
Linus Torvalds	664a41b8a9	Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (430 commits) [media] ir-mce_kbd-decoder: include module.h for its facilities [media] ov5642: include module.h for its facilities [media] em28xx: Fix DVB-C maxsize for em2884 [media] tda18271c2dd: Fix saw filter configuration for DVB-C @6MHz [media] v4l: mt9v032: Fix Bayer pattern [media] V4L: mt9m111: rewrite set_pixfmt [media] V4L: mt9m111: fix missing return value check mt9m111_reg_clear [media] V4L: initial driver for ov5642 CMOS sensor [media] V4L: sh_mobile_ceu_camera: fix Oops when USERPTR mapping fails [media] V4L: soc-camera: remove soc-camera bus and devices on it [media] V4L: soc-camera: un-export the soc-camera bus [media] V4L: sh_mobile_csi2: switch away from using the soc-camera bus notifier [media] V4L: add media bus configuration subdev operations [media] V4L: soc-camera: group struct field initialisations together [media] V4L: soc-camera: remove now unused soc-camera specific PM hooks [media] V4L: pxa-camera: switch to using standard PM hooks [media] NetUP Dual DVB-T/C CI RF: force card hardware revision by module param [media] Don't OOPS if videobuf_dvb_get_frontend return NULL [media] NetUP Dual DVB-T/C CI RF: load firmware according card revision [media] omap3isp: Support configurable HS/VS polarities ... Fix up conflicts: - arch/arm/mach-omap2/board-rx51-peripherals.c: cleanup regulator supply definitions in mach-omap2 vs OMAP3: RX-51: define vdds_csib regulator supply - drivers/staging/tm6000/tm6000-alsa.c (trivial)	2011-07-30 00:08:53 -07:00
Linus Torvalds	cb7dee8d22	Merge branch 'next/dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc * 'next/dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: (21 commits) arm/dt: tegra devicetree support arm/versatile: Add device tree support dt/irq: add irq_domain_generate_simple() helper irq: add irq_domain translation infrastructure dmaengine: imx-sdma: add device tree probe support dmaengine: imx-sdma: sdma_get_firmware does not need to copy fw_name dmaengine: imx-sdma: use platform_device_id to identify sdma version mmc: sdhci-esdhc-imx: add device tree probe support mmc: sdhci-pltfm: dt device does not pass parent to sdhci_alloc_host mmc: sdhci-esdhc-imx: get rid of the uses of cpu_is_mx() mmc: sdhci-esdhc-imx: do not reference platform data after probe mmc: sdhci-esdhc-imx: extend card_detect and write_protect support for mx5 net/fec: add device tree probe support net: ibm_newemac: convert it to use of_get_phy_mode dt/net: add helper function of_get_phy_mode net/fec: gasket needs to be enabled for some i.mx serial/imx: add device tree probe support serial/imx: get rid of the uses of cpu_is_mx1() arm/dt: Add dtb make rule arm/dt: Add skeleton dtsi file ...	2011-07-29 23:32:02 -07:00
Arnd Bergmann	6124a4e430	Merge branch 'imx/dt' into next/dt	2011-07-28 15:25:46 +00:00
Sebastian Andrzej Siewior	b6873807a7	irq: Track the owner of irq descriptor Interrupt descriptors can be allocated from modules. The interrupts are used by other modules, but we have no refcount on the module which provides the interrupts and there is no way to establish one on the device level as the interrupt using module is agnostic to the fact that the interrupt is provided by a module rather than by some builtin interrupt controller. To prevent removal of the interrupt providing module, we can track the owner of the interrupt descriptor, which also provides the relevant irq chip functions in the irq descriptor. request/setup_irq() can now acquire a refcount on the owner module to prevent unloading. free_irq() drops the refcount. Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Link: http://lkml.kernel.org/r/20110711101731.GA13804@Chamillionaire.breakpoint.cc Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-07-28 11:23:21 +02:00
Sebastian Andrzej Siewior	f3637a5f2e	irq: Always set IRQF_ONESHOT if no primary handler is specified If no primary handler is specified then a default one is assigned which always returns IRQ_WAKE_THREAD. This handler requires the IRQF_ONESHOT flag on LEVEL / EIO typed irqs because the source of interrupt is not disabled. Since it is required for those users and there is no difference for others it makes sense to add this flag unconditionally. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: http://lkml.kernel.org/r/1310070737-18514-1-git-send-email-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-07-28 11:23:21 +02:00
Grant Likely	7e71330169	dt/irq: add irq_domain_generate_simple() helper irq_domain_generate_simple() is an easy way to generate an irq translation domain for simple irq controllers. It assumes a flat 1:1 mapping from hardware irq number to an offset of the first linux irq number assigned to the controller Signed-off-by: Grant Likely <grant.likely@secretlab.ca>	2011-07-28 01:32:04 -06:00
Grant Likely	08a543ad33	irq: add irq_domain translation infrastructure This patch adds irq_domain infrastructure for translating from hardware irq numbers to linux irqs. This is particularly important for architectures adding device tree support because the current implementation (excluding PowerPC and SPARC) cannot handle translation for more than a single interrupt controller. irq_domain supports device tree translation for any number of interrupt controllers. This patch converts x86, Microblaze, ARM and MIPS to use irq_domain for device tree irq translation. x86 is untested beyond compiling it, irq_domain is enabled for MIPS and Microblaze, but the old behaviour is preserved until the core code is modified to actually register an irq_domain yet. On ARM it works and is required for much of the new ARM device tree board support. PowerPC has /not/ been converted to use this new infrastructure. It is still missing some features before it can replace the virq infrastructure already in powerpc (see documentation on irq_domain_map/unmap for details). Followup patches will add the missing pieces and migrate PowerPC to use irq_domain. SPARC has its own method of managing interrupts from the device tree and is unaffected by this change. Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>	2011-07-28 01:32:04 -06:00
Linus Torvalds	95b6886526	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (54 commits) tpm_nsc: Fix bug when loading multiple TPM drivers tpm: Move tpm_tis_reenable_interrupts out of CONFIG_PNP block tpm: Fix compilation warning when CONFIG_PNP is not defined TOMOYO: Update kernel-doc. tpm: Fix a typo tpm_tis: Probing function for Intel iTPM bug tpm_tis: Fix the probing for interrupts tpm_tis: Delay ACPI S3 suspend while the TPM is busy tpm_tis: Re-enable interrupts upon (S3) resume tpm: Fix display of data in pubek sysfs entry tpm_tis: Add timeouts sysfs entry tpm: Adjust interface timeouts if they are too small tpm: Use interface timeouts returned from the TPM tpm_tis: Introduce durations sysfs entry tpm: Adjust the durations if they are too small tpm: Use durations returned from TPM TOMOYO: Enable conditional ACL. TOMOYO: Allow using argv[]/envp[] of execve() as conditions. TOMOYO: Allow using executable's realpath and symlink's target as conditions. TOMOYO: Allow using owner/group etc. of file objects as conditions. ... Fix up trivial conflict in security/tomoyo/realpath.c	2011-07-27 19:26:38 -07:00
Hans Verkuil	2330fb8242	[media] v4l2-compat-ioctl32: add VIDIOC_DQEVENT support Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>	2011-07-27 17:53:20 -03:00
Oleg Nesterov	c1095c6da5	signals: sys_ssetmask/sys_rt_sigsuspend should use set_current_blocked() sys_ssetmask(), sys_rt_sigsuspend() and compat_sys_rt_sigsuspend() change ->blocked directly. This is not correct, see the changelog in `e6fa16ab` "signal: sigprocmask() should do retarget_shared_pending()" Change them to use set_current_blocked(). Another change is that now we are doing ->saved_sigmask = ->blocked lockless, it doesn't make any sense to do this under ->siglock. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-27 12:53:36 -07:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
Hugh Dickins	4302fbc8ec	panic: panic=-1 for immediate reboot When a kernel BUG or oops occurs, ChromeOS intends to panic and immediately reboot, with stacktrace and other messages preserved in RAM across reboot. But the longer we delay, the more likely the user is to poweroff and lose the info. panic_timeout (seconds before rebooting) is set by panic= boot option or sysctl or /proc/sys/kernel/panic; but 0 means wait forever, so at present we have to delay at least 1 second. Let a negative number mean reboot immediately (with the small cosmetic benefit of suppressing that newline-less "Rebooting in %d seconds.." message). Signed-off-by: Hugh Dickins <hughd@chromium.org> Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Olaf Hering <olaf@aepfle.de> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:45 -07:00
Vitaliy Ivanov	947be5dfda	gcov: disable CONSTRUCTORS for UML Selecting GCOV for UML causing configuration mismatch: warning: (GCOV_KERNEL) selects CONSTRUCTORS which has unmet direct dependencies (!UML) Constructors are not needed for UML. Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com> Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Acked-by: Richard Weinberger <richard@nod.at> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:45 -07:00
Vasiliy Kulikov	b34a6b1da3	ipc: introduce shm_rmid_forced sysctl Add support for the shm_rmid_forced sysctl. If set to 1, all shared memory objects in current ipc namespace will be automatically forced to use IPC_RMID. The POSIX way of handling shmem allows one to create shm objects and call shmdt(), leaving shm object associated with no process, thus consuming memory not counted via rlimits. With shm_rmid_forced=1 the shared memory object is counted at least for one process, so OOM killer may effectively kill the fat process holding the shared memory. It obviously breaks POSIX - some programs relying on the feature would stop working. So set shm_rmid_forced=1 only if you're sure nobody uses "orphaned" memory. Use shm_rmid_forced=0 by default for compatability reasons. The feature was previously impemented in -ow as a configure option. [akpm@linux-foundation.org: fix documentation, per Randy] [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: readability/conventionality tweaks] [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout] Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Solar Designer <solar@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:44 -07:00
Daniel Rebelo de Oliveira	fb0a685cb9	kernel/fork.c: fix a few coding style issues Signed-off-by: Daniel Rebelo de Oliveira <psykon@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:44 -07:00
Michal Hocko	778d3b0ff0	cpusets: randomize node rotor used in cpuset_mem_spread_node() [ This patch has already been accepted as commit `0ac0c0d0f8` but later reverted (commit `35926ff5fb`) because it itroduced arch specific __node_random which was defined only for x86 code so it broke other archs. This is a followup without any arch specific code. Other than that there are no functional changes.] Some workloads that create a large number of small files tend to assign too many pages to node 0 (multi-node systems). Part of the reason is that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at node 0 for newly created tasks. This patch changes the rotor to be initialized to a random node number of the cpuset. [akpm@linux-foundation.org: fix layout] [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration] [mhocko@suse.cz: Make it arch independent] [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build] Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Menage <menage@google.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:43 -07:00
Shawn Bohrer	9ea71503a8	futex: Fix regression with read only mappings commit `7485d0d375` (futexes: Remove rw parameter from get_futex_key()) in 2.6.33 fixed two problems: First, It prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW MAP_PRIVATE futex operations by forcing the COW to occur by unconditionally performing a write access get_user_pages_fast() to get the page. The commit also introduced a user-mode regression in that it broke futex operations on read-only memory maps. For example, this breaks workloads that have one or more reader processes doing a FUTEX_WAIT on a futex within a read only shared file mapping, and a writer processes that has a writable mapping issuing the FUTEX_WAKE. This fixes the regression for valid futex operations on RO mappings by trying a RO get_user_pages_fast() when the RW get_user_pages_fast() fails. This change makes it necessary to also check for invalid use cases, such as anonymous RO mappings (which can never change) and the ZERO_PAGE which the commit referenced above was written to address. This patch does restore the original behavior with RO MAP_PRIVATE mappings, which have inherent user-mode usage problems and don't really make sense. With this patch performing a FUTEX_WAIT within a RO MAP_PRIVATE mapping will be successfully woken provided another process updates the region of the underlying mapped file. However, the mmap() man page states that for a MAP_PRIVATE mapping: It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. So user-mode users attempting to use futex operations on RO MAP_PRIVATE mappings are depending on unspecified behavior. Additionally a RO MAP_PRIVATE mapping could fail to wake up in the following case. Thread-A: call futex(FUTEX_WAIT, memory-region-A). get_futex_key() return inode based key. sleep on the key Thread-B: call mprotect(PROT_READ\|PROT_WRITE, memory-region-A) Thread-B: write memory-region-A. COW happen. This process's memory-region-A become related to new COWed private (ie PageAnon=1) page. Thread-B: call futex(FUETX_WAKE, memory-region-A). get_futex_key() return mm based key. IOW, we fail to wake up Thread-A. Once again doing something like this is just silly and users who do something like this get what they deserve. While RO MAP_PRIVATE mappings are nonsensical, checking for a private mapping requires walking the vmas and was deemed too costly to avoid a userspace hang. This Patch is based on Peter Zijlstra's initial patch with modifications to only allow RO mappings for futex operations that need VERIFY_READ access. Reported-by: David Oliver <david@rgmadvisors.com> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: peterz@infradead.org Cc: eric.dumazet@gmail.com Cc: zvonler@rgmadvisors.com Cc: hughd@google.com Link: http://lkml.kernel.org/r/1309450892-30676-1-git-send-email-sbohrer@rgmadvisors.com Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-07-26 20:59:35 +02:00
jhbird.choi@samsung.com	1dd75f91ae	genirq: Fix wrong bit operation (!msk & 0x01) should be !(msk & 0x01) Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Link: http://lkml.kernel.org/r/1311229754-6003-1-git-send-email-jhbird.choi@samsung.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2011-07-26 16:24:02 +02:00
Linus Torvalds	45b583b10a	Merge 'akpm' patch series * Merge akpm patch series: (122 commits) drivers/connector/cn_proc.c: remove unused local Documentation/SubmitChecklist: add RCU debug config options reiserfs: use hweight_long() reiserfs: use proper little-endian bitops pnpacpi: register disabled resources drivers/rtc/rtc-tegra.c: properly initialize spinlock drivers/rtc/rtc-twl.c: check return value of twl_rtc_write_u8() in twl_rtc_set_time() drivers/rtc: add support for Qualcomm PMIC8xxx RTC drivers/rtc/rtc-s3c.c: support clock gating drivers/rtc/rtc-mpc5121.c: add support for RTC on MPC5200 init: skip calibration delay if previously done misc/eeprom: add eeprom access driver for digsy_mtc board misc/eeprom: add driver for microwire 93xx46 EEPROMs checkpatch.pl: update $logFunctions checkpatch: make utf-8 test --strict checkpatch.pl: add ability to ignore various messages checkpatch: add a "prefer __aligned" check checkpatch: validate signature styles and To: and Cc: lines checkpatch: add __rcu as a sparse modifier checkpatch: suggest using min_t or max_t ... Did this as a merge because of (trivial) conflicts in - Documentation/feature-removal-schedule.txt - arch/xtensa/include/asm/uaccess.h that were just easier to fix up in the merge than in the patch series.	2011-07-25 21:00:19 -07:00
Stephen Boyd	626a031251	kernel/configs.c: include MODULE_() when CONFIG_IKCONFIG_PROC=n If CONFIG_IKCONFIG=m but CONFIG_IKCONFIG_PROC=n we get a module that has no MODULE_LICENSE definition. Move the MODULE_() definitions outside the CONFIG_IKCONFIG_PROC #ifdef to prevent this configuration from tainting the kernel. Signed-off-by: Stephen Boyd <bebarino@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:15 -07:00
Amerigo Wang	c5f41752fd	notifiers: sys: move reboot notifiers into reboot.h It is not necessary to share the same notifier.h. This patch already moves register_reboot_notifier() and unregister_reboot_notifier() from kernel/notifier.c to kernel/sys.c. [amwang@redhat.com: make allyesconfig succeed on ppc64] Signed-off-by: WANG Cong <amwang@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:14 -07:00
Maxin B John	ae891a1b93	devres: fix possible use after free devres uses the pointer value as key after it's freed, which is safe but triggers spurious use-after-free warnings on some static analysis tools. Rearrange code to avoid such warnings. Signed-off-by: Maxin B. John <maxin.john@gmail.com> Reviewed-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:14 -07:00
Benjamin Herrenschmidt	2efaca927f	mm/futex: fix futex writes on archs with SW tracking of dirty & young I haven't reproduced it myself but the fail scenario is that on such machines (notably ARM and some embedded powerpc), if you manage to hit that futex path on a writable page whose dirty bit has gone from the PTE, you'll livelock inside the kernel from what I can tell. It will go in a loop of trying the atomic access, failing, trying gup to "fix it up", getting succcess from gup, go back to the atomic access, failing again because dirty wasn't fixed etc... So I think you essentially hang in the kernel. The scenario is probably rare'ish because affected architecture are embedded and tend to not swap much (if at all) so we probably rarely hit the case where dirty is missing or young is missing, but I think Shan has a piece of SW that can reliably reproduce it using a shared writable mapping & fork or something like that. On archs who use SW tracking of dirty & young, a page without dirty is effectively mapped read-only and a page without young unaccessible in the PTE. Additionally, some architectures might lazily flush the TLB when relaxing write protection (by doing only a local flush), and expect a fault to invalidate the stale entry if it's still present on another processor. The futex code assumes that if the "in_atomic()" access -EFAULT's, it can "fix it up" by causing get_user_pages() which would then be equivalent to taking the fault. However that isn't the case. get_user_pages() will not call handle_mm_fault() in the case where the PTE seems to have the right permissions, regardless of the dirty and young state. It will eventually update those bits ... in the struct page, but not in the PTE. Additionally, it will not handle the lazy TLB flushing that can be required by some architectures in the fault case. Basically, gup is the wrong interface for the job. The patch provides a more appropriate one which boils down to just calling handle_mm_fault() since what we are trying to do is simulate a real page fault. The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those "features" to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reported-by: Shan Hai <haishan.bai@gmail.com> Tested-by: Shan Hai <haishan.bai@gmail.com> Cc: David Laight <David.Laight@ACULAB.COM> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <darren.hart@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:11 -07:00
Linus Torvalds	154dd78d30	Merge branches 'kbuild', 'packaging' and 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: genksyms: Use same type in loop comparison kbuild: silence generated makefile message kernel: prevent unnecessary rebuilding due to config_data.gz headers_install: fix __packed in exported kernel headers dtc: regen parser dtc: migrate parser to implicit rules kconfig: regen parser kconfig: migrate parser to implicit rules kconfig/zconf.l: do not ask to generate backup kconfig: kill no longer needed reference to YYDEBUG kconfig: constify `kconf_id_lookup' genksym: regen parser genksyms: migrate parser to implicit rules genksyms: drop -Wno-uninitialized from HOSTCFLAGS_parse.tab.o genksyms: pass hash and lookup functions name and target language though the input file kbuild: simplify the %_shipped rule kbuild: add implicit rules for parser generation kbuild: add `baseprereq' kbuild: Fix reference to vermagic.h * 'packaging' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: package: Makefile: fix perf target bug * 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: gitignore: ignore debian build directory	2011-07-25 20:01:57 -07:00
Linus Torvalds	d3ec4844d4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) fs: Merge split strings treewide: fix potentially dangerous trailing ';' in #defined values/expressions uwb: Fix misspelling of neighbourhood in comment net, netfilter: Remove redundant goto in ebt_ulog_packet trivial: don't touch files that are removed in the staging tree lib/vsprintf: replace link to Draft by final RFC number doc: Kconfig: `to be' -> `be' doc: Kconfig: Typo: square -> squared doc: Konfig: Documentation/power/{pm => apm-acpi}.txt drivers/net: static should be at beginning of declaration drivers/media: static should be at beginning of declaration drivers/i2c: static should be at beginning of declaration XTENSA: static should be at beginning of declaration SH: static should be at beginning of declaration MIPS: static should be at beginning of declaration ARM: static should be at beginning of declaration rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Update my e-mail address PCIe ASPM: forcedly -> forcibly gma500: push through device driver tree ... Fix up trivial conflicts: - arch/arm/mach-ep93xx/dma-m2p.c (deleted) - drivers/gpio/gpio-ep93xx.c (renamed and context nearby) - drivers/net/r8169.c (just context changes)	2011-07-25 13:56:39 -07:00
Linus Torvalds	096a705bbc	Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...	2011-07-25 10:33:36 -07:00
Jesper Juhl	f629299b54	trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED What was scheduled to be 2.6.41 is now going to be 3.1 . Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1107250929370.8080@swampdragon.chaosbits.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-25 09:37:21 +02:00
Linus Torvalds	fcda12e7f6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: modpost: Fix modpost's license checking V3 module: add /sys/module/<name>/uevent files module: change attr callbacks to take struct module_kobject modules: make arch's use default loader hooks modules: add default loader hook implementations param: fix return value handling in param_set_*	2011-07-24 09:54:54 -07:00
Linus Torvalds	5fabc487c9	Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits) KVM: IOMMU: Disable device assignment without interrupt remapping KVM: MMU: trace mmio page fault KVM: MMU: mmio page fault support KVM: MMU: reorganize struct kvm_shadow_walk_iterator KVM: MMU: lockless walking shadow page table KVM: MMU: do not need atomicly to set/clear spte KVM: MMU: introduce the rules to modify shadow page table KVM: MMU: abstract some functions to handle fault pfn KVM: MMU: filter out the mmio pfn from the fault pfn KVM: MMU: remove bypass_guest_pf KVM: MMU: split kvm_mmu_free_page KVM: MMU: count used shadow pages on prepareing path KVM: MMU: rename 'pt_write' to 'emulate' KVM: MMU: cleanup for FNAME(fetch) KVM: MMU: optimize to handle dirty bit KVM: MMU: cache mmio info on page fault path KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code KVM: MMU: do not update slot bitmap if spte is nonpresent KVM: MMU: fix walking shadow page table KVM guest: KVM Steal time registration ...	2011-07-24 09:07:03 -07:00
Kay Sievers	88bfa32479	module: add /sys/module/<name>/uevent files Userspace wants to manage module parameters with udev rules. This currently only works for loaded modules, but not for built-in ones. To allow access to the built-in modules we need to re-trigger all module load events that happened before any userspace was running. We already do the same thing for all devices, subsystems(buses) and drivers. This adds the currently missing /sys/module/<name>/uevent files to all module entries. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (split & trivial fix)	2011-07-24 22:06:04 +09:30
Kay Sievers	4befb026cf	module: change attr callbacks to take struct module_kobject This simplifies the next patch, where we have an attribute on a builtin module (ie. module == NULL). Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (split into 2)	2011-07-24 22:06:04 +09:30
Jonas Bonn	74e08fcf7b	modules: add default loader hook implementations The module loader code allows architectures to hook into the code by providing a small number of entry points that each arch must implement. This patch provides __weakly linked generic implementations of these entry points for architectures that don't need to do anything special. Signed-off-by: Jonas Bonn <jonas@southpole.se> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-07-24 22:06:04 +09:30
Satoru Moriya	81c7413650	param: fix return value handling in param_set_* In STANDARD_PARAM_DEF, param_set_* handles the case in which strtolfn returns -EINVAL but it may return -ERANGE. If it returns -ERANGE, param_set_* may set uninitialized value to the paramerter. We should handle both cases. The one of the cases in which strtolfn() returns -ERANGE is following: Type of module parameter is long Set the parameter more than LONG_MAX Signed-off-by: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-07-24 22:06:03 +09:30
Linus Torvalds	bbd9d6f7fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.	2011-07-22 19:02:39 -07:00
Linus Torvalds	dc43d9fa73	Merge branch 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mtrr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, mtrr: Use pci_dev->revision x86, mtrr: use stop_machine APIs for doing MTRR rendezvous stop_machine: implement stop_machine_from_inactive_cpu() stop_machine: reorganize stop_cpus() implementation x86, mtrr: lock stop machine during MTRR rendezvous sequence	2011-07-22 17:04:04 -07:00
Linus Torvalds	112ec46966	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: time: Fix stupid KERN_WARN compile issue rtc: Avoid accumulating time drift in suspend/resume time: Avoid accumulating time drift in suspend/resume time: Catch invalid timespec sleep values in __timekeeping_inject_sleeptime	2011-07-22 16:52:18 -07:00
Linus Torvalds	bdc7ccfc06	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits) sched: Cleanup duplicate local variable in [enqueue\|dequeue]_task_fair sched: Replace use of entity_key() sched: Separate group-scheduling code more clearly sched: Reorder root_domain to remove 64 bit alignment padding sched: Do not attempt to destroy uninitialized rt_bandwidth sched: Remove unused function cpu_cfs_rq() sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED' sched, cgroup: Optimize load_balance_fair() sched: Don't update shares twice on on_rq parent sched: update correct entity's runtime in check_preempt_wakeup() xtensa: Use generic config PREEMPT definition h8300: Use generic config PREEMPT definition m32r: Use generic PREEMPT config sched: Skip autogroup when looking for all rt sched groups sched: Simplify mutex_spin_on_owner() sched: Remove rcu_read_lock() from wake_affine() sched: Generalize sleep inside spinlock detection sched: Make sleeping inside spinlock detection working in !CONFIG_PREEMPT sched: Isolate preempt counting in its own config option sched: Remove pointless in_atomic() definition check ...	2011-07-22 16:45:02 -07:00
Linus Torvalds	4d4abdcb1d	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits) perf: Remove the nmi parameter from the oprofile_perf backend x86, perf: Make copy_from_user_nmi() a library function perf: Remove perf_event_attr::type check x86, perf: P4 PMU - Fix typos in comments and style cleanup perf tools: Make test use the preset debugfs path perf tools: Add automated tests for events parsing perf tools: De-opt the parse_events function perf script: Fix display of IP address for non-callchain path perf tools: Fix endian conversion reading event attr from file header perf tools: Add missing 'node' alias to the hw_cache[] array perf probe: Support adding probes on offline kernel modules perf probe: Add probed module in front of function perf probe: Introduce debuginfo to encapsulate dwarf information perf-probe: Move dwarf library routines to dwarf-aux.{c, h} perf probe: Remove redundant dwarf functions perf probe: Move strtailcmp to string.c perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END tracing/kprobe: Update symbol reference when loading module tracing/kprobes: Support module init function probing kprobes: Return -ENOENT if probe point doesn't exist ...	2011-07-22 16:44:39 -07:00
Linus Torvalds	0342cbcfce	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Fix wrong check in list_splice_init_rcu() net,rcu: Convert call_rcu(xt_rateest_free_rcu) to kfree_rcu() sysctl,rcu: Convert call_rcu(free_head) to kfree vmalloc,rcu: Convert call_rcu(rcu_free_vb) to kfree_rcu() vmalloc,rcu: Convert call_rcu(rcu_free_va) to kfree_rcu() ipc,rcu: Convert call_rcu(ipc_immediate_free) to kfree_rcu() ipc,rcu: Convert call_rcu(free_un) to kfree_rcu() security,rcu: Convert call_rcu(sel_netport_free) to kfree_rcu() security,rcu: Convert call_rcu(sel_netnode_free) to kfree_rcu() ia64,rcu: Convert call_rcu(sn_irq_info_free) to kfree_rcu() block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu() scsi,rcu: Convert call_rcu(fc_rport_free_rcu) to kfree_rcu() audit_tree,rcu: Convert call_rcu(__put_tree) to kfree_rcu() security,rcu: Convert call_rcu(whitelist_item_free) to kfree_rcu() md,rcu: Convert call_rcu(free_conf) to kfree_rcu()	2011-07-22 16:44:08 -07:00
Linus Torvalds	391d6276db	Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Fix trace_[soft,hard]irqs_[on,off]() recursion printk: Fix console_sem vs logbuf_lock unlock race printk: Release console_sem after logbuf_lock	2011-07-22 16:43:49 -07:00
Linus Torvalds	75b56ec294	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Fix lockdep_no_validate against IRQ states mutex: Make mutex_destroy() an inline function plist: Remove the need to supply locks to plist heads lockup detector: Fix reference to the non-existent CONFIG_DETECT_SOFTLOCKUP option	2011-07-22 16:43:21 -07:00
Linus Torvalds	431bf99d26	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: (51 commits) PM: Improve error code of pm_notifier_call_chain() PM: Add "RTC" to PM trace time stamps to avoid confusion PM / Suspend: Export suspend_set_ops, suspend_valid_only_mem PM / Suspend: Add .suspend_again() callback to suspend_ops PM / OPP: Introduce function to free cpufreq table ARM / shmobile: Return -EBUSY from A4LC power off if A3RV is active PM / Domains: Take .power_off() error code into account ARM / shmobile: Use genpd_queue_power_off_work() ARM / shmobile: Use pm_genpd_poweroff_unused() PM / Domains: Introduce function to power off all unused PM domains OMAP: PM: disable idle on suspend for GPIO and UART OMAP: PM: omap_device: add API to disable idle on suspend OMAP: PM: omap_device: add system PM methods for PM domain handling OMAP: PM: omap_device: conditionally use PM domain runtime helpers PM / Runtime: Add new helper function: pm_runtime_status_suspended() PM / Domains: Queue up power off work only if it is not pending PM / Domains: Improve handling of wakeup devices during system suspend PM / Domains: Do not restore all devices on power off error PM / Domains: Allow callbacks to execute all runtime PM helpers PM / Domains: Do not execute device callbacks under locks ...	2011-07-22 16:01:57 -07:00
Linus Torvalds	5a791ea4fa	Merge branch 'for-3.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-3.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: separate out drain_workqueue() from destroy_workqueue() workqueue: remove cancel_rearming_delayed_work[queue]()	2011-07-22 15:07:15 -07:00
Linus Torvalds	8209f53d79	Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits) ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction connector: add an event for monitoring process tracers ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task() ptrace_init_task: initialize child->jobctl explicitly has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/ ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/ ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented() ptrace: ptrace_reparented() should check same_thread_group() redefine thread_group_leader() as exit_signal >= 0 do not change dead_task->exit_signal kill task_detached() reparent_leader: check EXIT_DEAD instead of task_detached() make do_notify_parent() __must_check, update the callers __ptrace_detach: avoid task_detached(), check do_notify_parent() kill tracehook_notify_death() make do_notify_parent() return bool ptrace: s/tracehook_tracer_task()/ptrace_parent()/ ...	2011-07-22 15:06:50 -07:00
Linus Torvalds	951cc93a74	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1287 commits) icmp: Fix regression in nexthop resolution during replies. net: Fix ppc64 BPF JIT dependencies. acenic: include NET_SKB_PAD headroom to incoming skbs ixgbe: convert to ndo_fix_features ixgbe: only enable WoL for magic packet by default ixgbe: remove ifdef check for non-existent define ixgbe: Pass staterr instead of re-reading status and error bits from descriptor ixgbe: Move interrupt related values out of ring and into q_vector ixgbe: add structure for containing RX/TX rings to q_vector ixgbe: inline the ixgbe_maybe_stop_tx function ixgbe: Update ATR to use recorded TX queues instead of CPU for routing igb: Fix for DH89xxCC near end loopback test e1000: always call e1000_check_for_link() on e1000_ce4100 MACs. netxen: add fw version compatibility check be2net: request native mode each time the card is reset ipv4: Constrain UFO fragment sizes to multiples of 8 bytes virtio_net: Fix panic in virtnet_remove ipv6: make fragment identifications less predictable ipv6: unshare inetpeers can: make function can_get_bittiming static ...	2011-07-22 14:43:13 -07:00
Lin Ming	0f3171438f	sched: Cleanup duplicate local variable in [enqueue\|dequeue]_task_fair No need to define a new "cfs_rq" variable in the "for" block. Just use the one at the top of the function. Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-22 12:47:22 +02:00
David S. Miller	033b1142f4	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/bluetooth/l2cap_core.c	2011-07-21 13:38:42 -07:00
David S. Miller	f5caadbb3d	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6	2011-07-21 12:39:35 -07:00
Peter Zijlstra	efbe2eee6d	lockdep: Fix lockdep_no_validate against IRQ states Thomas noticed that a lock marked with lockdep_set_novalidate_class() will still trigger warnings for IRQ inversions. Cure this by skipping those when marking irq state. Reported-and-tested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-2dp5vmpsxeraqm42kgww6ge2@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 20:43:16 +02:00
Lin Ming	9985c20f9e	perf: Remove perf_event_attr::type check PMU type id can be allocated dynamically, so perf_event_attr::type check when copying attribute from userspace to kernel is not valid. Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309421396-17438-4-git-send-email-ming.m.lin@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 20:41:55 +02:00
Stephan Baerwolf	2bd2d6f2dc	sched: Replace use of entity_key() "entity_key()" is only used in "__enqueue_entity()" and its only function is to subtract a tasks vruntime by its groups minvruntime. Before this patch a rbtree enqueue-decision is done by comparing two tasks in the style: "if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))" which would be "if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)" or (if reducing cfs_rq->min_vruntime out) "if (se->vruntime < entry->vruntime)" which is "if (entity_before(se, entry))" So we do not need "entity_key()". If "entity_before()" is inline we will also save one subtraction (only one, because "entity_key(cfs_rq, se)" was cached in "key") Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:55 +02:00
Jan H. Schönherr	acb5a9ba3b	sched: Separate group-scheduling code more clearly Clean up cfs/rt runqueue initialization by moving group scheduling related code into the corresponding functions. Also, keep group scheduling as an add-on, so that things are only done additionally, i. e. remove the init__rq() calls from init_tg__entry(). (This removes a redundant initalization during sched_init()). In case of group scheduling rt_rq->highest_prio.curr is now initialized twice, but adding another #ifdef seems not worth it. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310661163-16606-1-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:54 +02:00
Richard Kennedy	26a148eb9c	sched: Reorder root_domain to remove 64 bit alignment padding Reorder root_domain to remove 8 bytes of alignment padding on 64 bit builds, this shrinks the size from 1736 to 1728 bytes, therefore using one fewer cachelines. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310726492.1977.5.camel@castor.rsk Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:52 +02:00
Bianca Lutz	99bc52429f	sched: Do not attempt to destroy uninitialized rt_bandwidth If a task group is to be created and alloc_fair_sched_group() fails, then the rt_bandwidth of the corresponding task group is not yet initialized. The caller, sched_create_group(), starts a clean up procedure which calls free_rt_sched_group() which unconditionally destroys the not yet initialized rt_bandwidth. This crashes or hangs the system in lock_hrtimer_base(): UP systems dereference a NULL pointer, while SMP systems loop endlessly on a condition that cannot become true. This patch simply avoids the destruction of rt_bandwidth when the initialization code path was not reached. (This was discovered by accident with a custom kernel modification.) Signed-off-by: Bianca Lutz <sowilo@cs.tu-berlin.de> Signed-off-by: Jan Schoenherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310580816-10861-7-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:51 +02:00
Jan Schoenherr	045176d22f	sched: Remove unused function cpu_cfs_rq() The last reference to cpu_cfs_rq() was removed with commit `88ec22d3` ("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus, remove this function, too. Signed-off-by: Jan Schoenherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:49 +02:00
Jan Schoenherr	5f817d676b	sched: Fix (harmless) typo 'CONFG_FAIR_GROUP_SCHED' This patch fixes a typo located in a comment. Signed-off-by: Jan Schoenherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310580816-10861-2-git-send-email-schnhrr@cs.tu-berlin.de Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:48 +02:00
Peter Zijlstra	9763b67fb9	sched, cgroup: Optimize load_balance_fair() Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this achieves that load_balance_fair() only iterates those task_groups that actually have tasks on busiest, and that we iterate bottom-up, trying to move light groups before the heavier ones. No idea if it will actually work out to be beneficial in practice, does anybody have a cgroup workload that might show a difference one way or the other? [ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:46 +02:00
Paul Turner	9598c82dca	sched: Don't update shares twice on on_rq parent In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity with additional weight. However, we perform a double shares update on this entity as we continue the shares update traversal from this point, despite dequeue_entity() having already updated its queuing cfs_rq. Avoid this by starting from the parent when we resume. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110707053059.797714697@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:44 +02:00
Paul Turner	9bbd737436	sched: update correct entity's runtime in check_preempt_wakeup() While looking at check_preempt_wakeup() I realized that we are potentially updating the wrong entity in the fair-group scheduling case. In this case the current task's cfs_rq may not be the same as the one used for the comparison between the waking task and the existing task's vruntime. This potentially results in us using a stale vruntime in the pre-emption decision, providing a small false preference for the previous task. The effects of this are bounded since we always perform a hierarchal update on the tick. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/CAPM31R+2Ke2urUZKao5W92_LupdR4AYEv-EZWiJ3tG=tEes2cw@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:01:43 +02:00
Ingo Molnar	994bf1c922	Merge branch 'linus' into sched/core Merge reason: pick up the latest scheduler fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 18:00:01 +02:00
Oleg Nesterov	8a35241803	ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction Simple test-case, int main(void) { int pid, status; pid = fork(); if (!pid) { pause(); assert(0); return 0x23; } assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0); assert(wait(&status) == pid); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); kill(pid, SIGCONT); // <--- also clears STOP_DEQUEUD assert(ptrace(PTRACE_CONT, pid, 0,0) == 0); assert(wait(&status) == pid); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGCONT); assert(ptrace(PTRACE_CONT, pid, 0, SIGSTOP) == 0); assert(wait(&status) == pid); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); kill(pid, SIGKILL); return 0; } Without the patch it hangs. After the patch SIGSTOP "injected" by the tracer is not ignored and stops the tracee. Note also that if this test-case uses, say, SIGWINCH instead of SIGCONT, everything works without the patch. This can't be right, and this is confusing. The problem is that SIGSTOP (or any other sig_kernel_stop() signal) has no effect without JOBCTL_STOP_DEQUEUED. This means it is simply ignored after PTRACE_CONT unless JOBCTL_STOP_DEQUEUED was set "by accident", say it wasn't cleared after initial SIGSTOP sent by PTRACE_ATTACH. At first glance we could change ptrace_signal() to add STOP_DEQUEUED after return from ptrace_stop(), but this is not right in case when the tracer does not change the reported SIGSTOP and SIGCONT comes in between. This is even more wrong with PT_SEIZED, SIGCONT adds JOBCTL_TRAP_NOTIFY which will be "lost" during the TRAP_STOP \| TRAP_NOTIFY report. So lets add STOP_DEQUEUED _before_ we report the signal. It has no effect unless sig_kernel_stop() == T after the tracer resumes us, and in the latter case the pending STOP_DEQUEUED means no SIGCONT in between, we should stop. Note also that if SIGCONT was sent, PT_SEIZED tracee will correctly report PTRACE_EVENT_STOP/SIGTRAP and thus the tracer can notice the fact SIGSTOP was cancelled. Also, move the current->ptrace check from ptrace_signal() to its caller, get_signal_to_deliver(), this looks more natural. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-07-21 17:06:53 +02:00
Ingo Molnar	40bcea7bbe	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-07-21 09:32:40 +02:00
Ingo Molnar	492f73a303	Merge branch 'perf/urgent' into perf/core Merge reason: pick up the latest fixes - they won't make v3.0. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 09:29:21 +02:00
Christoph Hellwig	11b80f459a	rw_semaphore: remove up/down_read_non_owner Now that the last users is gone these can be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:47 -04:00
Linus Torvalds	cf6ace16a3	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: signal: align __lock_task_sighand() irq disabling and RCU softirq,rcu: Inform RCU of irq_exit() activity sched: Add irq_{enter,exit}() to scheduler_ipi() rcu: protect __rcu_read_unlock() against scheduler-using irq handlers rcu: Streamline code produced by __rcu_read_unlock() rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special rcu: decrease rcu_report_exp_rnp coupling with scheduler	2011-07-20 15:56:25 -07:00
John Stultz	cbaa51524b	time: Fix stupid KERN_WARN compile issue Terribly embarassing. Don't know how I committed this, but its KERN_WARNING not KERN_WARN. This fixes the following compile error: kernel/time/timekeeping.c: In function ‘__timekeeping_inject_sleeptime’: kernel/time/timekeeping.c:608: error: ‘KERN_WARN’ undeclared (first use in this function) kernel/time/timekeeping.c:608: error: (Each undeclared identifier is reported only once kernel/time/timekeeping.c:608: error: for each function it appears in.) kernel/time/timekeeping.c:608: error: expected ‘)’ before string constant make[2]: *** [kernel/time/timekeeping.o] Error 1 Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-07-20 15:42:55 -07:00
Paul E. McKenney	a95cded32d	sysctl,rcu: Convert call_rcu(free_head) to kfree The RCU callback free_head just calls kfree(), so we can use kfree_rcu() instead of call_rcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-07-20 14:10:18 -07:00
Lai Jiangshan	3b097c4696	audit_tree,rcu: Convert call_rcu(__put_tree) to kfree_rcu() The rcu callback __put_tree() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(__put_tree). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-07-20 14:10:11 -07:00
Ingo Molnar	d1e9ae47a0	Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent	2011-07-20 20:59:26 +02:00
Paul E. McKenney	a841796f11	signal: align __lock_task_sighand() irq disabling and RCU The __lock_task_sighand() function calls rcu_read_lock() with interrupts and preemption enabled, but later calls rcu_read_unlock() with interrupts disabled. It is therefore possible that this RCU read-side critical section will be preempted and later RCU priority boosted, which means that rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but with interrupts disabled. This results in lockdep splats, so this commit nests the RCU read-side critical section within the interrupt-disabled region of code. This prevents the RCU read-side critical section from being preempted, and thus prevents the attempt to deboost with interrupts disabled. It is quite possible that a better long-term fix is to make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's ->wait_lock. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-07-20 11:04:54 -07:00
Peter Zijlstra	ec433f0c51	softirq,rcu: Inform RCU of irq_exit() activity The rcu_read_unlock_special() function relies on in_irq() to exclude scheduler activity from interrupt level. This fails because exit_irq() can invoke the scheduler after clearing the preempt_count() bits that in_irq() uses to determine that it is at interrupt level. This situation can result in failures as follows: $task IRQ SoftIRQ rcu_read_lock() /* do stuff / <preempt> \|= UNLOCK_BLOCKED rcu_read_unlock() --t->rcu_read_lock_nesting irq_enter(); / do stuff, don't use RCU / irq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); invoke_softirq() ttwu(); spin_lock_irq(&pi->lock) rcu_read_lock(); / do stuff / rcu_read_unlock(); rcu_read_unlock_special() rcu_report_exp_rnp() ttwu() spin_lock_irq(&pi->lock) / deadlock */ rcu_read_unlock_special(t); Ed can simply trigger this 'easy' because invoke_softirq() immediately does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first, but even without that the above happens. Cure this by also excluding softirqs from the rcu_read_unlock_special() handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done from full softirq context. [ Alternatively, delaying the ->rcu_read_lock_nesting decrement until after the special handling would make the thing more robust in the face of interrupts as well. And there is a separate patch for that. ] Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-tested-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-07-20 10:50:12 -07:00
Peter Zijlstra	c5d753a55a	sched: Add irq_{enter,exit}() to scheduler_ipi() Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual work. Traditionally we never did any actual work from the resched IPI and all magic happened in the return from interrupt path. Now that we do do some work, we need to ensure irq_{enter,exit} are called so that we don't confuse things. This affects things like timekeeping, NO_HZ and RCU, basically everything with a hook in irq_enter/exit. Explicit examples of things going wrong are: sched_clock_cpu() -- has a callback when leaving NO_HZ state to take a new reading from GTOD and TSC. Without this callback, time is stuck in the past. RCU -- needs in_irq() to work in order to avoid some nasty deadlocks Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-07-20 10:50:11 -07:00
Paul E. McKenney	10f39bb1b2	rcu: protect __rcu_read_unlock() against scheduler-using irq handlers The addition of RCU read-side critical sections within runqueue and priority-inheritance lock critical sections introduced some deadlock cycles, for example, involving interrupts from __rcu_read_unlock() where the interrupt handlers call wake_up(). This situation can cause the instance of __rcu_read_unlock() invoked from interrupt to do some of the processing that would otherwise have been carried out by the task-level instance of __rcu_read_unlock(). When the interrupt-level instance of __rcu_read_unlock() is called with a scheduler lock held from interrupt-entry/exit situations where in_irq() returns false, deadlock can result. This commit resolves these deadlocks by using negative values of the per-task ->rcu_read_lock_nesting counter to indicate that an instance of __rcu_read_unlock() is in flight, which in turn prevents instances from interrupt handlers from doing any special processing. This patch is inspired by Steven Rostedt's earlier patch that similarly made __rcu_read_unlock() guard against interrupt-mediated recursion (see https://lkml.org/lkml/2011/7/15/326), but this commit refines Steven's approach to avoid the need for preemption disabling on the __rcu_read_unlock() fastpath and to also avoid the need for manipulating a separate per-CPU variable. This patch avoids need for preempt_disable() by instead using negative values of the per-task ->rcu_read_lock_nesting counter. Note that nested rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will never see ->rcu_read_lock_nesting go to zero, and will therefore never invoke rcu_read_unlock_special(), thus preventing them from seeing the RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special. This patch also adds a check for ->rcu_read_unlock_special being negative in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS bit from being set should a scheduling-clock interrupt occur while __rcu_read_unlock() is exiting from an outermost RCU read-side critical section. Of course, __rcu_read_unlock() can be preempted during the time that ->rcu_read_lock_nesting is negative. This could result in the setting of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it, and would also result it this task being queued on the corresponding rcu_node structure's blkd_tasks list. Therefore, some later RCU read-side critical section would enter rcu_read_unlock_special() to clean up -- which could result in deadlock if that critical section happened to be in the scheduler where the runqueue or priority-inheritance locks were held. This situation is dealt with by making rcu_preempt_note_context_switch() check for negative ->rcu_read_lock_nesting, thus refraining from queuing the task (and from setting RCU_READ_UNLOCK_BLOCKED) if we are already exiting from the outermost RCU read-side critical section (in other words, we really are no longer actually in that RCU read-side critical section). In addition, rcu_preempt_note_context_switch() invokes rcu_read_unlock_special() to carry out the cleanup in this case, which clears out the ->rcu_read_unlock_special bits and dequeues the task (if necessary), in turn avoiding needless delay of the current RCU grace period and needless RCU priority boosting. It is still illegal to call rcu_read_unlock() while holding a scheduler lock if the prior RCU read-side critical section has ever had either preemption or irqs enabled. However, the common use case is legal, namely where then entire RCU read-side critical section executes with irqs disabled, for example, when the scheduler lock is held across the entire lifetime of the RCU read-side critical section. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-07-20 10:50:11 -07:00
Peter Zijlstra	d110235d2c	sched: Avoid creating superfluous NUMA domains on non-NUMA systems When creating sched_domains, stop when we've covered the entire target span instead of continuing to create domains, only to later find they're redundant and throw them away again. This avoids single node systems from touching funny NUMA sched_domain creation code and reduces the risks of the new SD_OVERLAP code. Requested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: mahesh@linux.vnet.ibm.com Cc: benh@kernel.crashing.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-20 18:54:33 +02:00
Peter Zijlstra	e3589f6c81	sched: Allow for overlapping sched_domain spans Allow for sched_domain spans that overlap by giving such domains their own sched_group list instead of sharing the sched_groups amongst each-other. This is needed for machines with more than 16 nodes, because sched_domain_node_span() will generate a node mask from the 16 nearest nodes without regard if these masks have any overlap. Currently sched_domains have a sched_group that maps to their child sched_domain span, and since there is no overlap we share the sched_group between the sched_domains of the various CPUs. If however there is overlap, we would need to link the sched_group list in different ways for each cpu, and hence sharing isn't possible. In order to solve this, allocate private sched_groups for each CPU's sched_domain but have the sched_groups share a sched_group_power structure such that we can uniquely track the power. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-20 18:32:41 +02:00
Peter Zijlstra	9c3f75cbd1	sched: Break out cpu_power from the sched_group structure In order to prepare for non-unique sched_groups per domain, we need to carry the cpu_power elsewhere, so put a level of indirection in. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-20 18:32:40 +02:00
Al Viro	6657719390	make sure that nsproxy_cache is initialized early enough Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:07 -04:00
Al Viro	3bfa784a65	kill file_permission() completely convert the last remaining caller to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:11 -04:00
Paul E. McKenney	be0e1e21ef	rcu: Streamline code produced by __rcu_read_unlock() Given some common flag combinations, particularly -Os, gcc will inline rcu_read_unlock_special() despite its being in an unlikely() clause. Use noinline to prohibit this misoptimization. In addition, move the second barrier() in __rcu_read_unlock() so that it is not on the common-case code path. This will allow the compiler to generate better code for the common-case path through __rcu_read_unlock(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>	2011-07-19 21:38:53 -07:00
Paul E. McKenney	7765be2fec	rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special The RCU_BOOST commits for TREE_PREEMPT_RCU introduced an other-task write to a new RCU_READ_UNLOCK_BOOSTED bit in the task_struct structure's ->rcu_read_unlock_special field, but, as noted by Steven Rostedt, without correctly synchronizing all accesses to ->rcu_read_unlock_special. This could result in bits in ->rcu_read_unlock_special being spuriously set and cleared due to conflicting accesses, which in turn could result in deadlocks between the rcu_node structure's ->lock and the scheduler's rq and pi locks. These deadlocks would result from RCU incorrectly believing that the just-ended RCU read-side critical section had been preempted and/or boosted. If that RCU read-side critical section was executed with either rq or pi locks held, RCU's ensuing (incorrect) calls to the scheduler would cause the scheduler to attempt to once again acquire the rq and pi locks, resulting in deadlock. More complex deadlock cycles are also possible, involving multiple rq and pi locks as well as locks from multiple rcu_node structures. This commit fixes synchronization by creating ->rcu_boosted field in task_struct that is accessed and modified only when holding the ->lock in the rcu_node structure on which the task is queued (on that rcu_node structure's ->blkd_tasks list). This results in tasks accessing only their own current->rcu_read_unlock_special fields, making unsynchronized access once again legal, and keeping the rcu_read_unlock() fastpath free of atomic instructions and memory barriers. The reason that the rcu_read_unlock() fastpath does not need to access the new current->rcu_boosted field is that this new field cannot be non-zero unless the RCU_READ_UNLOCK_BLOCKED bit is set in the current->rcu_read_unlock_special field. Therefore, rcu_read_unlock() need only test current->rcu_read_unlock_special: if that is zero, then current->rcu_boosted must also be zero. This bug does not affect TINY_PREEMPT_RCU because this implementation of RCU accesses current->rcu_read_unlock_special with irqs disabled, thus preventing races on the !SMP systems that TINY_PREEMPT_RCU runs on. Maybe-reported-by: Dave Jones <davej@redhat.com> Maybe-reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-19 21:38:52 -07:00
Paul E. McKenney	131906b006	rcu: decrease rcu_report_exp_rnp coupling with scheduler PREEMPT_RCU read-side critical sections blocking an expedited grace period invoke rcu_report_exp_rnp(). When the last such critical section has completed, rcu_report_exp_rnp() invokes the scheduler to wake up the task that invoked synchronize_rcu_expedited() -- needlessly holding the root rcu_node structure's lock while doing so, thus needlessly providing a way for RCU and the scheduler to deadlock. This commit therefore releases the root rcu_node structure's lock before calling wake_up(). Reported-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-07-19 21:38:51 -07:00
Peter Foley	e78e8f2d83	kernel: prevent unnecessary rebuilding due to config_data.gz When IKCONFIG is built-in make oldconfig will cause the kernel to be relinked even if .config didn't change. This happens because of a config_data.gz dependency on .config. This patch changes the if_changed to a filechk so that config_data.h is only rebuilt when the contents have actually changed. Signed-off-by: Peter Foley <pefoley2@verizon.net> Signed-off-by: Michal Marek <mmarek@suse.cz>	2011-07-20 01:32:32 +02:00
Vladimir Zapolskiy	f701e5b73a	connector: add an event for monitoring process tracers This change adds a procfs connector event, which is emitted on every successful process tracer attach or detach. If some process connects to other one, kernelspace connector reports process id and thread group id of both these involved processes. On disconnection null process id is returned. Such an event allows to create a simple automated userspace mechanism to be aware about processes connecting to others, therefore predefined process policies can be applied to them if needed. Note, a detach signal is emitted only in case, if a tracer process explicitly executes PTRACE_DETACH request. In other cases like tracee or tracer exit detach event from proc connector is not reported. Signed-off-by: Vladimir Zapolskiy <vzapolskiy@gmail.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-07-18 21:38:33 +02:00
Oleg Nesterov	dcace06cc2	ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task() If the new child is traced, do_fork() adds the pending SIGSTOP. It assumes that either it is traced because of auto-attach or the tracer attached later, in both cases sigaddset/set_thread_flag is correct even if SIGSTOP is already pending. Now that we have PTRACE_SEIZE this is no longer right in the latter case. If the tracer does PTRACE_SEIZE after copy_process() makes the child visible the queued SIGSTOP is wrong. We could check PT_SEIZED bit and change ptrace_attach() to set both PT_PTRACED and PT_SEIZED bits simultaneously but see the next patch, we need to know whether this child was auto-attached or not anyway. So this patch simply moves this code to ptrace_init_task(), this way we can never race with ptrace_attach(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-07-17 20:23:51 +02:00
Oleg Nesterov	961c4675c7	has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/ has_stopped_jobs() naively checks task_is_stopped(group_leader). This was always wrong even without ptrace, group_leader can be dead. And given that ptrace can change the state to TRACED this is wrong even in the single-threaded case. Change the code to check SIGNAL_STOP_STOPPED and simplify the code, retval + break/continue doesn't make this trivial code more readable. We could probably add the usual "\|\| signal->group_stop_count" check but I don't think this makes sense, the task can start the group-stop right after the check anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-07-17 20:23:50 +02:00
Rafael J. Wysocki	ba1389d74f	Merge branch 'pm-domains' into for-linus * pm-domains: (33 commits) ARM / shmobile: Return -EBUSY from A4LC power off if A3RV is active PM / Domains: Take .power_off() error code into account ARM / shmobile: Use genpd_queue_power_off_work() ARM / shmobile: Use pm_genpd_poweroff_unused() PM / Domains: Introduce function to power off all unused PM domains PM / Domains: Queue up power off work only if it is not pending PM / Domains: Improve handling of wakeup devices during system suspend PM / Domains: Do not restore all devices on power off error PM / Domains: Allow callbacks to execute all runtime PM helpers PM / Domains: Do not execute device callbacks under locks PM / Domains: Make failing pm_genpd_prepare() clean up properly PM / Domains: Set device state to "active" during system resume ARM: mach-shmobile: sh7372 A3RV requires A4LC PM / Domains: Export pm_genpd_poweron() in header ARM: mach-shmobile: sh7372 late pm domain off ARM: mach-shmobile: Runtime PM late init callback ARM: mach-shmobile: sh7372 D4 support ARM: mach-shmobile: sh7372 A4MP support ARM: mach-shmobile: sh7372: make sure that fsi is peripheral of spu2 ARM: mach-shmobile: sh7372 A3SG support ...	2011-07-15 23:59:09 +02:00
Akinobu Mita	f0c077a8b7	PM: Improve error code of pm_notifier_call_chain() This enables pm_notifier_call_chain() to get the actual error code in the callback rather than always assume -EINVAL by converting all PM notifier calls to return encapsulate error code with notifier_from_errno(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-07-15 23:58:20 +02:00
Kevin Hilman	a5e4fd8783	PM / Suspend: Export suspend_set_ops, suspend_valid_only_mem Some platforms wish to implement their PM core suspend code as modules. To do so, these functions need to be exported to modules. [rjw: Replaced EXPORT_SYMBOL with EXPORT_SYMBOL_GPL] Reported-by: Jean Pihet <j-pihet@ti.com> Signed-off-by: Kevin Hilman <khilman@ti.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-07-15 23:58:19 +02:00
MyungJoo Ham	3b5fe85252	PM / Suspend: Add .suspend_again() callback to suspend_ops A system or a device may need to control suspend/wakeup events. It may want to wakeup the system after a predefined amount of time or at a predefined event decided while entering suspend for polling or delayed work. Then, it may want to enter suspend again if its predefined wakeup condition is the only wakeup reason and there is no outstanding events; thus, it does not wakeup the userspace unnecessary or unnecessary devices and keeps suspended as long as possible (saving the power). Enabling a system to wakeup after a specified time can be easily achieved by using RTC. However, to enter suspend again immediately without invoking userland and unrelated devices, we need additional features in the suspend framework. Such need comes from: 1. Monitoring a critical device status without interrupts that can wakeup the system. (in-suspend polling) An example is ambient temperature monitoring that needs to shut down the system or a specific device function if it is too hot or cold. The temperature of a specific device may be needed to be monitored as well; e.g., a charger monitors battery temperature in order to stop charging if overheated. 2. Execute critical "delayed work" at suspend. A driver or a system/board may have a delayed work (or any similar things) that it wants to execute at the requested time. For example, some chargers want to check the battery voltage some time (e.g., 30 seconds) after the battery is fully charged and the charger has stopped. Then, the charger restarts charging if the voltage has dropped more than a threshold, which is smaller than "restart-charger" voltage, which is a threshold to restart charging regardless of the time passed. This patch allows to add "suspend_again" callback at struct platform_suspend_ops and let the "suspend_again" callback return true if the system is required to enter suspend again after the current instance of wakeup. Device-wise suspend_again implemented at dev_pm_ops or syscore is not done because: a) suspend_again feature is usually under platform-wise decision and controls the behavior of the whole platform and b) There are very limited devices related to the usage cases of suspend_again; chargers and temperature sensors are mentioned so far. With suspend_again callback registered at struct platform_suspend_ops suspend_ops in kernel/power/suspend.c with suspend_set_ops by the platform, the suspend framework tries to enter suspend again by looping suspend_enter() if suspend_again has returned true and there has been no errors in the suspending sequence or pending wakeups (by pm_wakeup_pending). Tested at Exynos4-NURI. [rjw: Fixed up kerneldoc comment for suspend_enter().] Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-07-15 23:58:19 +02:00
Masami Hiramatsu	7f6878a3d7	tracing/kprobe: Update symbol reference when loading module Since the address of a module-local variable can only be solved after the target module is loaded, the symbol fetch-argument should be updated when loading target module. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: http://lkml.kernel.org/r/20110627072703.6528.75042.stgit@fedora15 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-15 15:45:32 -04:00
Masami Hiramatsu	6142431810	tracing/kprobes: Support module init function probing To support probing module init functions, kprobe-tracer allows user to define a probe on non-existed function when it is given with a module name. This also enables user to set a probe on a function on a specific module, even if a same name (but different) function is locally defined in another module. The module name must be in the front of function name and separated by a ':'. e.g. btrfs:btrfs_init_sysfs Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: http://lkml.kernel.org/r/20110627072656.6528.89970.stgit@fedora15 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-15 15:17:14 -04:00
Masami Hiramatsu	bc81d48d13	kprobes: Return -ENOENT if probe point doesn't exist Return -ENOENT if probe point doesn't exist, but still returns -EINVAL if both of kprobe->addr and kprobe->symbol_name are specified or both are not specified. Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Link: http://lkml.kernel.org/r/20110627072650.6528.67329.stgit@fedora15 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-15 15:11:47 -04:00
Masami Hiramatsu	1538f888f1	tracing/kprobes: Merge trace probe enable/disable functions Merge redundant enable/disable functions into enable_trace_probe() and disable_trace_probe(). Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: yrl.pp-manager.tt@hitachi.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Link: http://lkml.kernel.org/r/20110627072644.6528.26910.stgit@fedora15 [ converted kprobe selftest to use enable_trace_probe ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-15 15:10:58 -04:00
Linus Torvalds	df8d6fe9ef	Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu * 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu: rcu: Prevent RCU callbacks from executing before scheduler initialized	2011-07-15 09:54:34 -07:00
Peter Zijlstra	c64be78ffb	sched: Fix 32bit race Commit `3fe1698b7f` ("sched: Deal with non-atomic min_vruntime reads on 32bit") forgot to initialize min_vruntime_copy which could lead to an infinite while loop in task_waking_fair() under some circumstances (early boot, lucky timing). [ This bug was also reported by others that blamed it on the RCU initialization problems ] Reported-and-tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-15 09:54:02 -07:00
Steven Rostedt	f7bc8b61f6	ftrace: Fix regression where ftrace breaks when modules are loaded Enabling function tracer to trace all functions, then load a module and then disable function tracing will cause ftrace to fail. This can also happen by enabling function tracing on the command line: ftrace=function and during boot up, modules are loaded, then you disable function tracing with 'echo nop > current_tracer' you will trigger a bug in ftrace that will shut itself down. The reason is, the new ftrace code keeps ref counts of all ftrace_ops that are registered for tracing. When one or more ftrace_ops are registered, all the records that represent the functions that the ftrace_ops will trace have a ref count incremented. If this ref count is not zero, when the code modification runs, that function will be enabled for tracing. If the ref count is zero, that function will be disabled from tracing. To make sure the accounting was working, FTRACE_WARN_ON()s were added to updating of the ref counts. If the ref count hits its max (> 2^30 ftrace_ops added), or if the ref count goes below zero, a FTRACE_WARN_ON() is triggered which disables all modification of code. Since it is common for ftrace_ops to trace all functions in the kernel, instead of creating > 20,000 hash items for the ftrace_ops, the hash count is just set to zero, and it represents that the ftrace_ops is to trace all functions. This is where the issues arrise. If you enable function tracing to trace all functions, and then add a module, the modules function records do not get the ref count updated. When the function tracer is disabled, all function records ref counts are subtracted. Since the modules never had their ref counts incremented, they go below zero and the FTRACE_WARN_ON() is triggered. The solution to this is rather simple. When modules are loaded, and their functions are added to the the ftrace pool, look to see if any ftrace_ops are registered that trace all functions. And for those, update the ref count for the module function records. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-14 23:02:27 -04:00
Masami Hiramatsu	7143f168e2	tracing/kprobes: Rename probe_* to trace_probe_* Rename probe_* to trace_probe_* for avoiding namespace confliction. This also fixes improper names of find_probe_event() and cleanup_all_probes() to find_trace_probe() and release_all_trace_probes() respectively. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20110627072636.6528.60374.stgit@fedora15 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-14 17:44:43 -04:00
Cyrill Gorcunov	f912987097	perf, x86: P4 PMU - Introduce event alias feature Instead of hw_nmi_watchdog_set_attr() weak function and appropriate x86_pmu::hw_watchdog_set_attr() call we introduce even alias mechanism which allow us to drop this routines completely and isolate quirks of Netburst architecture inside P4 PMU code only. The main idea remains the same though -- to allow nmi-watchdog and perf top run simultaneously. Note the aliasing mechanism applies to generic PERF_COUNT_HW_CPU_CYCLES event only because arbitrary event (say passed as RAW initially) might have some additional bits set inside ESCR register changing the behaviour of event and we can't guarantee anymore that alias event will give the same result. P.S. Thanks a huge to Don and Steven for for testing and early review. Acked-by: Don Zickus <dzickus@redhat.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: Ingo Molnar <mingo@elte.hu> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Stephane Eranian <eranian@google.com> CC: Lin Ming <ming.m.lin@intel.com> CC: Arnaldo Carvalho de Melo <acme@redhat.com> CC: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20110708201712.GS23657@sun Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-14 17:25:04 -04:00
Steven Rostedt	4a9bd3f134	tracing: Have dynamic size event stack traces Currently the stack trace per event in ftace is only 8 frames. This can be quite limiting and sometimes useless. Especially when the "ignore frames" is wrong and we also use up stack frames for the event processing itself. Change this to be dynamic by adding a percpu buffer that we can write a large stack frame into and then copy into the ring buffer. For interrupts and NMIs that come in while another event is being process, will only get to use the 8 frame stack. That should be enough as the task that it interrupted will have the full stack frame anyway. Requested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-14 16:36:53 -04:00
Glauber Costa	095c0aa83e	sched: adjust scheduler cpu power for stolen time This patch makes update_rq_clock() aware of steal time. The mechanism of operation is not different from irq_time, and follows the same principles. This lives in a CONFIG option itself, and can be compiled out independently of the rest of steal time reporting. The effect of disabling it is that the scheduler will still report steal time (that cannot be disabled), but won't use this information for cpu power adjustments. Everytime update_rq_clock_task() is invoked, we query information about how much time was stolen since last call, and feed it into sched_rt_avg_update(). Although steal time reporting in account_process_tick() keeps track of the last time we read the steal clock, in prev_steal_time, this patch do it independently using another field, prev_steal_time_rq. This is because otherwise, information about time accounted in update_process_tick() would never reach us in update_rq_clock(). Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:47 +03:00
Glauber Costa	e6e6685acc	KVM guest: Steal time accounting This patch accounts steal time time in account_process_tick. If one or more tick is considered stolen in the current accounting cycle, user/system accounting is skipped. Idle is fine, since the hypervisor does not report steal time if the guest is halted. Accounting steal time from the core scheduler give us the advantage of direct acess to the runqueue data. In a later opportunity, it can be used to tweak cpu power and make the scheduler aware of the time it lost. [avi: <asm/paravirt.h> doesn't exist on many archs] Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:46 +03:00
Glauber Costa	c9aaa8957f	KVM: Steal time implementation To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. [avi, yongjie: export delayacct_on, to avoid build failures in some configs] Signed-off-by: Glauber Costa <glommer@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Rik van Riel <riel@redhat.com> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:14 +03:00
Steven Rostedt	6331c28c96	ftrace: Fix dynamic selftest failure on some archs Archs that do not implement CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST, will fail the dynamic ftrace selftest. The function tracer has a quick 'off' variable that will prevent the call back functions from being called. This variable is called function_trace_stop. In x86, this is implemented directly in the mcount assembly, but for other archs, an intermediate function is used called ftrace_test_stop_func(). In dynamic ftrace, the function pointer variable ftrace_trace_function is used to update the caller code in the mcount caller. But for archs that do not have CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST set, it only calls ftrace_test_stop_func() instead, which in turn calls __ftrace_trace_function. When more than one ftrace_ops is registered, the function it calls is ftrace_ops_list_func(), which will iterate over all registered ftrace_ops and call the callbacks that have their hash matching. The issue happens when two ftrace_ops are registered for different functions and one is then unregistered. The __ftrace_trace_function is then pointed to the remaining ftrace_ops callback function directly. This mean it will be called for all functions that were registered to trace by both ftrace_ops that were registered. This is not an issue for archs with CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST, because the update of ftrace_trace_function doesn't happen until after all functions have been updated, and then the mcount caller is updated. But for those archs that do use the ftrace_test_stop_func(), the update is immediate. The dynamic selftest fails because it hits this situation, and the ftrace_ops that it registers fails to only trace what it was suppose to and instead traces all other functions. The solution is to delay the setting of __ftrace_trace_function until after all the functions have been updated according to the registered ftrace_ops. Also, function_trace_stop is set during the update to prevent function tracing from calling code that is caused by the function tracer itself. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-13 22:25:09 -04:00
Steven Rostedt	072126f452	ftrace: Update filter when tracing enabled in set_ftrace_filter() Currently, if set_ftrace_filter() is called when the ftrace_ops is active, the function filters will not be updated. They will only be updated when tracing is disabled and re-enabled. Update the functions immediately during set_ftrace_filter(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-13 22:10:05 -04:00
Steven Rostedt	41fb61c2d0	ftrace: Balance records when updating the hash Whenever the hash of the ftrace_ops is updated, the record counts must be balance. This requires disabling the records that are set in the original hash, and then enabling the records that are set in the updated hash. Moving the update into ftrace_hash_move() removes the bug where the hash was updated but the records were not, which results in ftrace triggering a warning and disabling itself because the ftrace_ops filter is updated while the ftrace_ops was registered, and then the failure happens when the ftrace_ops is unregistered. The current code will not trigger this bug, but new code will. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-13 22:00:50 -04:00
Tejun Heo	1f5026a7e2	memblock: Kill MEMBLOCK_ERROR `25818f0f28` (memblock: Make MEMBLOCK_ERROR be 0) thankfully made MEMBLOCK_ERROR 0 and there already are codes which expect error return to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its misery. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310457490-3356-6-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-13 16:36:01 -07:00
Paul E. McKenney	b0d304172f	rcu: Prevent RCU callbacks from executing before scheduler initialized Under some rare but real combinations of configuration parameters, RCU callbacks are posted during early boot that use kernel facilities that are not yet initialized. Therefore, when these callbacks are invoked, hard hangs and crashes ensue. This commit therefore prevents RCU callbacks from being invoked until after the scheduler is fully up and running, as in after multiple tasks have been spawned. It might well turn out that a better approach is to identify the specific RCU callbacks that are causing this problem, but that discussion will wait until such time as someone really needs an RCU callback to be invoked (as opposed to merely registered) during early boot. Reported-by: julie Sullivan <kernelmail.jms@gmail.com> Reported-by: RKK <kulkarni.ravi4@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: julie Sullivan <kernelmail.jms@gmail.com> Tested-by: RKK <kulkarni.ravi4@gmail.com>	2011-07-13 08:17:56 -07:00
Linus Torvalds	d93a881dd7	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: pcmcia: pxa2xx/vpac270: free gpios on exist rather than requesting ARM: pxa/raumfeld: fix device name for codec ak4104 ARM: pxa/raumfeld: display initialisation fixes ARM: pxa/raumfeld: adapt to upcoming hardware change ARM: pxa: fix gpio_to_chip() clash with gpiolib namespace genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd) arm: mach-vt8500: add forgotten irq_data conversion ARM: pxa168: correct nand pmu setting ARM: pxa910: correct nand pmu setting ARM: pxa: fix PGSR register address calculation	2011-07-12 14:19:51 -07:00
Alexander Graf	1dda606c5f	KVM: Add compat ioctl for KVM_SET_SIGNAL_MASK KVM has an ioctl to define which signal mask should be used while running inside VCPU_RUN. At least for big endian systems, this mask is different on 32-bit and 64-bit systems (though the size is identical). Add a compat wrapper that converts the mask to whatever the kernel accepts, allowing 32-bit kvm user space to set signal masks. This patch fixes qemu with --enable-io-thread on ppc64 hosts when running 32-bit user land. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:17 +03:00
Justin TerAvest	4aede84b33	fixlet: Remove fs_excl from struct task. fs_excl is a poor man's priority inheritance for filesystems to hint to the block layer that an operation is important. It was never clearly specified, not widely adopted, and will not prevent starvation in many cases (like across cgroups). fs_excl was introduced with the time sliced CFQ IO scheduler, to indicate when a process held FS exclusive resources and thus needed a boost. It doesn't cover all file systems, and it was never fully complete. Lets kill it. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-12 08:35:10 +02:00
Michael Witten	2dc98fd320	doc: Konfig: Documentation/power/{pm => apm-acpi}.txt Signed-off-by: Michael Witten <mfwitten@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-07-11 14:20:07 +02:00
Jiri Kosina	b7e9c223be	Merge branch 'master' into for-next Sync with Linus' tree to be able to apply pending patches that are based on newer code already present upstream.	2011-07-11 14:15:55 +02:00
Michal Hocko	d8bf4ca9ca	rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Since `ca5ecddf` (rcu: define __rcu address space modifier for sparse) rcu_dereference_check use rcu_read_lock_held as a part of condition automatically so callers do not have to do that as well. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-07-08 22:21:58 +02:00
Dima Zavin	732375c6a5	plist: Remove the need to supply locks to plist heads This was legacy code brought over from the RT tree and is no longer necessary. Signed-off-by: Dima Zavin <dima@android.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Walker <dwalker@codeaurora.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Link: http://lkml.kernel.org/r/1310084879-10351-2-git-send-email-dima@android.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-08 14:02:53 +02:00
Steven Rostedt	4376cac667	ftrace: Do not disable interrupts for modules in mcount update When I mounted an NFS directory, it caused several modules to be loaded. At the time I was running the preemptirqsoff tracer, and it showed the following output: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 2.6.33.9-rt30-mrg-test # -------------------------------------------------------------------- # latency: 1177 us, #4/4, CPU#3 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # \| task: modprobe-19370 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: ftrace_module_notify # => ended at: ftrace_module_notify # # # _------=> CPU# # / _-----=> irqs-off # \| / _----=> need-resched # \|\| / _---=> hardirq/softirq # \|\|\| / _--=> preempt-depth # \|\|\|\| /_--=> lock-depth # \|\|\|\|\|/ delay # cmd pid \|\|\|\|\|\| time \| caller # \ / \|\|\|\|\|\| \ \| / modprobe-19370 3d.... 0us!: ftrace_process_locs <-ftrace_module_notify modprobe-19370 3d.... 1176us : ftrace_process_locs <-ftrace_module_notify modprobe-19370 3d.... 1178us : trace_hardirqs_on <-ftrace_module_notify modprobe-19370 3d.... 1178us : <stack trace> => ftrace_process_locs => ftrace_module_notify => notifier_call_chain => __blocking_notifier_call_chain => blocking_notifier_call_chain => sys_init_module => system_call_fastpath That's over 1ms that interrupts are disabled on a Real-Time kernel! Looking at the cause (being the ftrace author helped), I found that the interrupts are disabled before the code modification of mcounts into nops. The interrupts only need to be disabled on start up around this code, not when modules are being loaded. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-07 22:39:38 -04:00
Steven Rostedt	e4a3f541f0	tracing: Still trace filtered irq functions when irq trace is disabled If a function is set to be traced by the set_graph_function, but the option funcgraph-irqs is zero, and the traced function happens to be called from a interrupt, it will not be traced. The point of funcgraph-irqs is to not trace interrupts when we are preempted by an irq, not to not trace functions we want to trace that happen to be in a irq. Luckily the current->trace_recursion element is perfect to add a flag to help us be able to trace functions within an interrupt even when we are not tracing interrupts that preempt the trace. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-07 22:26:27 -04:00
Linus Torvalds	31cb852809	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Fix free_unnecessary_pages()	2011-07-07 13:22:41 -07:00
Linus Torvalds	27a3b735b7	Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: debugobjects: Fix boot crash when kmemleak and debugobjects enabled * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: jump_label: Fix jump_label update for modules oprofile, x86: Fix race in nmi handler while starting counters * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Disable (revert) SCHED_LOAD_SCALE increase sched, cgroups: Fix MIN_SHARES on 64-bit boxen	2011-07-07 13:17:45 -07:00
Simon Guinot	659fb32d1b	genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd) This fixes a regression introduced by `e59347a` "arm: orion: Use generic irq chip". Depending on the device, interrupts acknowledgement is done by setting or by clearing a dedicated register. Replace irq_gc_ack() with some {set,clr}_bit variants allows to handle both cases. Note that this patch affects the following SoCs: Davinci, Samsung and Orion. Except for this last, the change is minor: irq_gc_ack() is just renamed into irq_gc_ack_set_bit(). For the Orion SoCs, the edge GPIO interrupts support is currently broken. irq_gc_ack() try to acknowledge a such interrupt by setting the corresponding cause register bit. The Orion GPIO device expect the opposite. To fix this issue, the irq_gc_ack_clr_bit() variant is used. Tested on Network Space v2. Reported-by: Joey Oravec <joravec@drewtech.com> Signed-off-by: Simon Guinot <sguinot@lacie.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2011-07-07 16:02:26 +00:00
Steven Rostedt	43dd61c9a0	ftrace: Fix regression of :mod:module function enabling The new code that allows different utilities to pick and choose what functions they trace broke the :mod: hook that allows users to trace only functions of a particular module. The reason is that the :mod: hook bypasses the hash that is setup to allow individual users to trace their own functions and uses the global hash directly. But if the global hash has not been set up, it will cause a bug: echo ':mod:radeon' > /sys/kernel/debug/set_ftrace_filter produces: [drm:drm_mode_getfb] ERROR* invalid framebuffer id [drm:radeon_crtc_page_flip] ERROR failed to reserve new rbo buffer before flip BUG: unable to handle kernel paging request at ffffffff8160ec90 IP: [<ffffffff810d9136>] add_hash_entry+0x66/0xd0 PGD 1a05067 PUD 1a09063 PMD 80000000016001e1 Oops: 0003 [#1] SMP Jul 7 04:02:28 phyllis kernel: [55303.858604] CPU 1 Modules linked in: cryptd aes_x86_64 aes_generic binfmt_misc rfcomm bnep ip6table_filter hid radeon r8169 ahci libahci mii ttm drm_kms_helper drm video i2c_algo_bit intel_agp intel_gtt Pid: 10344, comm: bash Tainted: G WC 3.0.0-rc5 #1 Dell Inc. Inspiron N5010/0YXXJJ RIP: 0010:[<ffffffff810d9136>] [<ffffffff810d9136>] add_hash_entry+0x66/0xd0 RSP: 0018:ffff88003a96bda8 EFLAGS: 00010246 RAX: ffff8801301735c0 RBX: ffffffff8160ec80 RCX: 0000000000306ee0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880137c92940 RBP: ffff88003a96bdb8 R08: ffff880137c95680 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff81c9df78 R13: ffff8801153d1000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f329c18a700(0000) GS:ffff880137c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff8160ec90 CR3: 000000003002b000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process bash (pid: 10344, threadinfo ffff88003a96a000, task ffff88012fcfc470) Stack: 0000000000000fd0 00000000000000fc ffff88003a96be38 ffffffff810d92f5 ffff88011c4c4e00 ffff880000000000 000000000b69f4d0 ffffffff8160ec80 ffff8800300e6f06 0000000081130295 0000000000000282 ffff8800300e6f00 Call Trace: [<ffffffff810d92f5>] match_records+0x155/0x1b0 [<ffffffff810d940c>] ftrace_mod_callback+0xbc/0x100 [<ffffffff810dafdf>] ftrace_regex_write+0x16f/0x210 [<ffffffff810db09f>] ftrace_filter_write+0xf/0x20 [<ffffffff81166e48>] vfs_write+0xc8/0x190 [<ffffffff81167001>] sys_write+0x51/0x90 [<ffffffff815c7e02>] system_call_fastpath+0x16/0x1b Code: 48 8b 33 31 d2 48 85 f6 75 33 49 89 d4 4c 03 63 08 49 8b 14 24 48 85 d2 48 89 10 74 04 48 89 42 08 49 89 04 24 4c 89 60 08 31 d2 RIP [<ffffffff810d9136>] add_hash_entry+0x66/0xd0 RSP <ffff88003a96bda8> CR2: ffffffff8160ec90 ---[ end trace a5d031828efdd88e ]--- Reported-by: Brian Marete <marete@toshnix.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-07 11:30:08 -04:00
Steven Rostedt	40ee4dffff	tracing: Have "enable" file use refcounts like the "filter" file The "enable" file for the event system can be removed when a module is unloaded and the event system only has events from that module. As the event system nr_events count goes to zero, it may be freed if its ref_count is also set to zero. Like the "filter" file, the "enable" file may be opened by a task and referenced later, after a module has been unloaded and the events for that event system have been removed. Although the "filter" file referenced the event system structure, the "enable" file only references a pointer to the event system name. Since the name is freed when the event system is removed, it is possible that an access to the "enable" file may reference a freed pointer. Update the "enable" file to use the subsystem_open() routine that the "filter" file uses, to keep a reference to the event system structure while the "enable" file is opened. Cc: <stable@kernel.org> Reported-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-07 11:22:29 -04:00
Steven Rostedt	e9dbfae53e	tracing: Fix bug when reading system filters on module removal The event system is freed when its nr_events is set to zero. This happens when a module created an event system and then later the module is removed. Modules may share systems, so the system is allocated when it is created and freed when the modules are unloaded and all the events under the system are removed (nr_events set to zero). The problem arises when a task opened the "filter" file for the system. If the module is unloaded and it removed the last event for that system, the system structure is freed. If the task that opened the filter file accesses the "filter" file after the system has been freed, the system will access an invalid pointer. By adding a ref_count, and using it to keep track of what is using the event system, we can free it after all users are finished with the event system. Cc: <stable@kernel.org> Reported-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-07 11:19:18 -04:00
Rafael J. Wysocki	4d4cf23cdd	PM / Hibernate: Fix free_unnecessary_pages() There is a bug in free_unnecessary_pages() that causes it to attempt to free too many pages in some cases, which triggers the BUG_ON() in memory_bm_clear_bit() for copy_bm. Namely, if count_data_pages() is initially greater than alloc_normal, we get to_free_normal equal to 0 and "save" greater from 0. In that case, if the sum of "save" and count_highmem_pages() is greater than alloc_highmem, we subtract a positive number from to_free_normal. Hence, since to_free_normal was 0 before the subtraction and is an unsigned int, the result is converted to a huge positive number that is used as the number of pages to free. Fix this bug by checking if to_free_normal is actually greater than or equal to the number we're going to subtract from it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-and-tested-by: Matthew Garrett <mjg@redhat.com> Cc: stable@kernel.org	2011-07-06 20:15:23 +02:00
Ram Pai	23c570a674	resource: ability to resize an allocated resource Provides the ability to resize a resource that is already allocated. This functionality is put in place to support reallocation needs of pci resources. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-06 10:54:08 -07:00
Ingo Molnar	931da6137e	Merge branch 'tip/perf/core-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-07-05 11:55:43 +02:00
Rafael J. Wysocki	b7b95920aa	PM: Allow the clocks management code to be used during system suspend The common clocks management code in drivers/base/power/clock_ops.c is going to be used during system-wide power transitions as well as for runtime PM, so it shouldn't depend on CONFIG_PM_RUNTIME. However, the suspend/resume functions provided by it for CONFIG_PM_RUNTIME unset, to be used during system-wide power transitions, should not behave in the same way as their counterparts defined for CONFIG_PM_RUNTIME set, because in that case the clocks are managed differently at run time. The names of the functions still contain the word "runtime" after this change, but that is going to be modified by a separate patch later. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reviewed-by: Kevin Hilman <khilman@ti.com>	2011-07-02 14:29:56 +02:00
Rafael J. Wysocki	f721889ff6	PM / Domains: Support for generic I/O PM domains (v8) Introduce common headers, helper functions and callbacks allowing platforms to use simple generic power domains for runtime power management. Introduce struct generic_pm_domain to be used for representing power domains that each contain a number of devices and may be parent domains or subdomains with respect to other power domains. Among other things, this structure includes callbacks to be provided by platforms for performing specific tasks related to power management (i.e. ->stop_device() may disable a device's clocks, while ->start_device() may enable them, ->power_off() is supposed to remove power from the entire power domain and ->power_on() is supposed to restore it). Introduce functions that can be used as power domain runtime PM callbacks, pm_genpd_runtime_suspend() and pm_genpd_runtime_resume(), as well as helper functions for the initialization of a power domain represented by a struct generic_power_domain object, adding a device to or removing a device from it and adding or removing subdomains. Introduce configuration option CONFIG_PM_GENERIC_DOMAINS to be selected by the platforms that want to use the new code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Reviewed-by: Kevin Hilman <khilman@ti.com>	2011-07-02 14:29:55 +02:00
Ingo Molnar	1ecc818c51	Merge branch 'sched/core-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into sched/core	2011-07-01 13:20:51 +02:00
Avi Kivity	26ca5c11fb	perf: export perf_event_refresh() to modules KVM needs one-shot samples, since a PMC programmed to -X will fire after X events and then again after 2^40 events (i.e. variable period). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309362157-6596-4-git-send-email-avi@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:40 +02:00
Avi Kivity	4dc0da8696	perf: Add context field to perf_event The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event->overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:38 +02:00
Peter Zijlstra	a7ac67ea02	perf: Remove the perf_output_begin(.sample) argument Since only samples call perf_output_sample() its much saner (and more correct) to put the sample logic in there than in the perf_output_begin()/perf_output_end() pair. Saves a useless argument, reduces conditionals and shrinks struct perf_output_handle, win! Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:35 +02:00
Peter Zijlstra	a8b0ca17b8	perf: Remove the nmi parameter from the swevent and overflow interface The nmi parameter indicated if we could do wakeups from the current context, if not, we would set some state and self-IPI and let the resulting interrupt do the wakeup. For the various event classes: - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from the PMI-tail (ARM etc.) - tracepoint: nmi=0; since tracepoint could be from NMI context. - software: nmi=[0,1]; some, like the schedule thing cannot perform wakeups, and hence need 0. As one can see, there is very little nmi=1 usage, and the down-side of not using it is that on some platforms some software events can have a jiffy delay in wakeup (when arch_irq_work_raise isn't implemented). The up-side however is that we can remove the nmi parameter and save a bunch of conditionals in fast paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Michael Cree <mcree@orcon.net.nz> Cc: Will Deacon <will.deacon@arm.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Eric B Munson <emunson@mgebm.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:35 +02:00
Cyrill Gorcunov	1880c4ae18	perf, x86: Add hw_watchdog_set_attr() in a sake of nmi-watchdog on P4 Due to restriction and specifics of Netburst PMU we need a separated event for NMI watchdog. In particular every Netburst event consumes not just a counter and a config register, but also an additional ESCR register. Since ESCR registers are grouped upon counters (i.e. if ESCR is occupied for some event there is no room for another event to enter until its released) we need to pick up the "least" used ESCR (or the most available one) for nmi-watchdog purposes -- so MSR_P4_CRU_ESCR2/3 was chosen. With this patch nmi-watchdog and perf top should be able to run simultaneously. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: Lin Ming <ming.m.lin@intel.com> CC: Arnaldo Carvalho de Melo <acme@redhat.com> CC: Frederic Weisbecker <fweisbec@gmail.com> Tested-and-reviewed-by: Don Zickus <dzickus@redhat.com> Tested-and-reviewed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110623124918.GC13050@sun Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:34 +02:00
Eric B Munson	0d6412085b	events: Ensure that timers are updated without requiring read() call The event tracing infrastructure exposes two timers which should be updated each time the value of the counter is updated. Currently, these counters are only updated when userspace calls read() on the fd associated with an event. This means that counters which are read via the mmap'd page exclusively never have their timers updated. This patch adds ensures that the timers are updated each time the values in the mmap'd page are updated. Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1308932786-5111-1-git-send-email-emunson@mgebm.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:34 +02:00
Eric B Munson	c479429591	events: Move lockless timer calculation into helper function Take the timer calculation from perf_output_read and move it to a helper function for any place that needs timer values but cannot take the ctx->lock. Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1308861279-15216-2-git-send-email-emunson@mgebm.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:33 +02:00
Eric B Munson	b7526f0ca6	events: Add note to update_event_times comment about holding ctx->lock Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1308861279-15216-1-git-send-email-emunson@mgebm.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:33 +02:00
Vince Weaver	4ec8363dfc	perf_events: Fix perf buffer watermark setting Since 2.6.36 (specifically commit `d57e34fdd6` ("perf: Simplify the ring-buffer logic: make perf_buffer_alloc() do everything needed"), the perf_buffer_init_code() has been mis-setting the buffer watermark if perf_event_attr.wakeup_events has a non-zero value. This is because perf_event_attr.wakeup_events is a union with perf_event_attr.wakeup_watermark. This commit re-enables the check for perf_event_attr.watermark being set before continuing with setting a non-default watermark. This bug is most noticable when you are trying to use PERF_IOC_REFRESH with a value larger than one and perf_event_attr.wakeup_events is set to one. In this case the buffer watermark will be set to 1 and you will get extraneous POLL_IN overflows rather than POLL_HUP as expected. [ avoid using attr.wakeup_events when attr.watermark is set ] Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1106011506390.5384@cl320.eecs.utk.edu Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:32 +02:00
Yong Zhang	1c09ab0d25	sched: Skip autogroup when looking for all rt sched groups Since commit `ec514c48` ("sched: Fix rt_rq runtime leakage bug") 'cat /proc/sched_debug' will print data of root_task_group.rt_rq multiple times. This is because autogroup does not have its own rt group, instead rt group of autogroup is linked to root_task_group. So skip it when we are looking for all rt sched groups, and it will also save some noop operation against root_task_group when __disable_runtime()/__enable_runtime(). -v2: Based on Cheng Xu's idea which uses less code. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Cheng Xu <chengxu@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTi=87P3RoTF_UEtamNfc_XGxQXE__Q@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 10:39:08 +02:00
Thomas Gleixner	307bf9803f	sched: Simplify mutex_spin_on_owner() It does not make sense to rcu_read_lock/unlock() in every loop iteration while spinning on the mutex. Move the rcu protection outside the loop. Also simplify the return path to always check for lock->owner == NULL which meets the requirements of both owner changed and need_resched() caused loop exits. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1106101458350.11814@ionos Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 10:39:07 +02:00
Nikunj A. Dadhania	2a46dae380	sched: Remove rcu_read_lock() from wake_affine() wake_affine() is only called from one path: select_task_rq_fair(), which already has the RCU read lock held. Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20110607101251.777.34547.stgit@IBM-009124035060.in.ibm.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 10:39:06 +02:00
Ingo Molnar	36b2e922b5	Merge commit 'v3.0-rc5' into sched/core Merge reason: Move to a (much) newer base. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 10:34:24 +02:00
Ingo Molnar	10e6962765	Merge commit 'v3.0-rc5' into perf/core Merge reason: Pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 10:28:46 +02:00
Mike Galbraith	cd62287e36	sched, cgroups: Fix MIN_SHARES on 64-bit boxen Commit `c8b28116` ("sched: Increase SCHED_LOAD_SCALE resolution") intended to have no user-visible effect, but allows setting cpu.shares to < MIN_SHARES, which the user then sees. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Link: http://lkml.kernel.org/r/1307192600.8618.3.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 10:25:03 +02:00
Mr Dash Four	131ad62d8f	netfilter: add SELinux context support to AUDIT target In this revision the conversion of secid to SELinux context and adding it to the audit log is moved from xt_AUDIT.c to audit.c with the aid of a separate helper function - audit_log_secctx - which does both the conversion and logging of SELinux context, thus also preventing internal secid number being leaked to userspace. If conversion is not successful an error is raised. With the introduction of this helper function the work done in xt_AUDIT.c is much more simplified. It also opens the possibility of this helper function being used by other modules (including auditd itself), if desired. With this addition, typical (raw auditd) output after applying the patch would be: type=NETFILTER_PKT msg=audit(1305852240.082:31012): action=0 hook=1 len=52 inif=? outif=eth0 saddr=10.1.1.7 daddr=10.1.2.1 ipid=16312 proto=6 sport=56150 dport=22 obj=system_u:object_r:ssh_client_packet_t:s0 type=NETFILTER_PKT msg=audit(1306772064.079:56): action=0 hook=3 len=48 inif=eth0 outif=? smac=00:05:5d:7c:27:0b dmac=00:02:b3:0a:7f:81 macproto=0x0800 saddr=10.1.2.1 daddr=10.1.1.7 ipid=462 proto=6 sport=22 dport=3561 obj=system_u:object_r:ssh_server_packet_t:s0 Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Mr Dash Four <mr.dash.four@googlemail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-06-30 13:31:57 +02:00
James Morris	5b944a71a1	Merge branch 'linus' into next	2011-06-30 18:43:56 +10:00
Xiao Guangrong	140fe3b1ab	jump_label: Fix jump_label update for modules The jump labels entries for modules do not stop at __stop__jump_table, but after mod->jump_entries + mod_num_jump_entries. By checking the wrong end point, module trace events never get enabled. Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Tested-by: Avi Kivity <avi@redhat.com> Tested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Link: http://lkml.kernel.org/r/4E00038B.2060404@cn.fujitsu.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-29 09:59:17 -04:00
Vasiliy Kulikov	26c4caea9d	taskstats: don't allow duplicate entries in listener mode Currently a single process may register exit handlers unlimited times. It may lead to a bloated listeners chain and very slow process terminations. Eg after 10KK sent TASKSTATS_CMD_ATTR_REGISTER_CPUMASKs ~300 Mb of kernel memory is stolen for the handlers chain and "time id" shows 2-7 seconds instead of normal 0.003. It makes it possible to exhaust all kernel memory and to eat much of CPU time by triggerring numerous exits on a single CPU. The patch limits the number of times a single process may register itself on a single CPU to one. One little issue is kept unfixed - as taskstats_exit() is called before exit_files() in do_exit(), the orphaned listener entry (if it was not explicitly deregistered) is kept until the next someone's exit() and implicit deregistration in send_cpu_listeners(). So, if a process registered itself as a listener exits and the next spawned process gets the same pid, it would inherit taskstats attributes. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-27 18:00:13 -07:00
Suresh Siddha	192d885742	x86, mtrr: use stop_machine APIs for doing MTRR rendezvous MTRR rendezvous sequence is not implemened using stop_machine() before, as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). Now that we have a new stop_machine_from_inactive_cpu() API, use it for rendezvous during mtrr init of a logical processor that is coming online. For the rest (runtime MTRR modification, system boot, resume paths), use stop_machine() to implement the rendezvous sequence. This will consolidate and cleanup the code. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182057.076997177@sbsiddha-MOBL3.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-06-27 15:17:13 -07:00
Tejun Heo	f740e6cd0c	stop_machine: implement stop_machine_from_inactive_cpu() Currently, mtrr wants stop_machine functionality while a CPU is being brought up. As stop_machine() requires the calling CPU to be active, mtrr implements its own stop_machine using stop_one_cpu() on each online CPU. This doesn't only unnecessarily duplicate complex logic but also introduces a possibility of deadlock when it races against the generic stop_machine(). This patch implements stop_machine_from_inactive_cpu() to serve such use cases. Its functionality is basically the same as stop_machine(); however, it should be called from a CPU which isn't active and doesn't depend on working scheduling on the calling CPU. This is achieved by using busy loops for synchronization and open-coding stop_cpus queuing and waiting with direct invocation of fn() for local CPU inbetween. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110623182056.982526827@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-06-27 15:17:08 -07:00
Tejun Heo	fd7355ba1e	stop_machine: reorganize stop_cpus() implementation Refactor the queuing part of the stop cpus work from __stop_cpus() into queue_stop_cpus_work(). The reorganization is to help future improvements to stop_machine() and doesn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110623182056.897818337@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-06-27 15:17:07 -07:00
Suresh Siddha	6d3321e8e2	x86, mtrr: lock stop machine during MTRR rendezvous sequence MTRR rendezvous sequence using stop_one_cpu_nowait() can potentially happen in parallel with another system wide rendezvous using stop_machine(). This can lead to deadlock (The order in which works are queued can be different on different cpu's. Some cpu's will be running the first rendezvous handler and others will be running the second rendezvous handler. Each set waiting for the other set to join for the system wide rendezvous, leading to a deadlock). MTRR rendezvous sequence is not implemented using stop_machine() as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). stop_machine() works with only online cpus. For now, take the stop_machine mutex in the MTRR rendezvous sequence that gets called from an online cpu (here we are in the process context and can potentially sleep while taking the mutex). And the MTRR rendezvous that gets triggered during cpu online doesn't need to take this stop_machine lock (as the stop_machine() already ensures that there is no cpu hotplug going on in parallel by doing get_online_cpus()) TBD: Pursue a cleaner solution of extending the stop_machine() infrastructure to handle the case where the calling cpu is still not online and use this for MTRR rendezvous sequence. fixes: https://bugzilla.novell.com/show_bug.cgi?id=672008 Reported-by: Vadim Kotelnikov <vadimuzzz@inbox.ru> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182056.807230326@sbsiddha-MOBL3.sc.intel.com Cc: stable@kernel.org # 2.6.35+, backport a week or two after this gets more testing in mainline Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-06-27 14:00:46 -07:00
Oleg Nesterov	479bf98c1c	ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/ wait_consider_task() checks same_thread_group(parent, real_parent), this is the open-coded ptrace_reparented(). __ptrace_detach() remains the only function which has to check this by hand, although we could reorganize the code to delay __ptrace_unlink. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:11 +02:00
Oleg Nesterov	bb3696da89	ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented() Kill real_parent_is_ptracer() and update the callers to use ptrace_reparented(), after the previous patch they do the same. Remove the unnecessary ->ptrace != 0 check in get_signal_to_deliver(), if ptrace_reparented() == T then the task must be ptraced. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:10 +02:00
Oleg Nesterov	d4f7c511c1	do not change dead_task->exit_signal __ptrace_detach() and do_notify_parent() set task->exit_signal = -1 to mark the task dead. This is no longer needed, nobody checks exit_signal to detect the EXIT_DEAD task. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:10 +02:00
Oleg Nesterov	e550f14dc6	kill task_detached() Upadate the last user of task_detached(), wait_task_zombie(), to use thread_group_leader() and kill task_detached(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:09 +02:00
Oleg Nesterov	0976a03e5c	reparent_leader: check EXIT_DEAD instead of task_detached() Change reparent_leader() to check ->exit_state instead of ->exit_signal, this matches the similar EXIT_DEAD check in wait_consider_task() and allows us to cleanup the do_notify_parent/task_detached logic. task_detached() was really needed during reparenting before `9cd80bbb` "do_wait() optimization: do not place sub-threads on ->children list" to filter out the sub-threads. After this change task_detached(p) can only be true if p is the dead group_leader and its parent ignores SIGCHLD, in this case the caller of do_notify_parent() is going to reap this task and it should set EXIT_DEAD. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:09 +02:00
Oleg Nesterov	8677347378	make do_notify_parent() __must_check, update the callers Change other callers of do_notify_parent() to check the value it returns, this makes the subsequent task_detached() unnecessary. Mark do_notify_parent() as __must_check. Use thread_group_leader() instead of !task_detached() to check if we need to notify the real parent in wait_task_zombie(). Remove the stale comment in release_task(). "just for sanity" is no longer true, we have to set EXIT_DEAD to avoid the races with do_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:09 +02:00
Oleg Nesterov	9843a1e977	__ptrace_detach: avoid task_detached(), check do_notify_parent() __ptrace_detach() relies on the current obscure behaviour of do_notify_parent(tsk) which changes tsk->exit_signal if this child should be silently reaped. That is why we check task_detached(), it is true if the task is sub-thread, or it is the group_leader but its exit_signal was changed by do_notify_parent(). This is confusing, change the code to rely on !thread_group_leader() or the value returned by do_notify_parent(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:08 +02:00
Oleg Nesterov	45cdf5cc07	kill tracehook_notify_death() Kill tracehook_notify_death(), reimplement the logic in its caller, exit_notify(). Also, change the exec_id's check to use thread_group_leader() instead of task_detached(), this is more clear. This logic only applies to the exiting leader, a sub-thread must never change its exit_signal. Note: when the traced group leader exits the exit_signal-or-SIGCHLD logic looks really strange: - we notify the tracer even if !thread_group_empty() but do_wait(WEXITED) can't work until all threads exit - if the tracer is real_parent, it is not clear why can't we use ->exit_signal event if !thread_group_empty() -v2: do not try to fix the 2nd oddity to avoid the subtle behavior change mixed with reorganization, suggested by Tejun. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:08 +02:00
Oleg Nesterov	53c8f9f199	make do_notify_parent() return bool - change do_notify_parent() to return a boolean, true if the task should be reaped because its parent ignores SIGCHLD. - update the only caller which checks the returned value, exit_notify(). This temporary uglifies exit_notify() even more, will be cleanuped by the next change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-06-27 20:30:08 +02:00
Linus Torvalds	8abf558834	Merge branch 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rtc: vt8500: Fix build error & cleanup rtc_class_ops->update_irq_enable() alarmtimers: Return -ENOTSUPP if no RTC device is present alarmtimers: Handle late rtc module loading	2011-06-25 07:23:59 -07:00
Frederic Weisbecker	d902db1eb6	sched: Generalize sleep inside spinlock detection The sleeping inside spinlock detection is actually used for more general sleeping inside atomic sections debugging: preemption disabled, rcu read side critical sections, interrupts, interrupt disabled, etc... Change the name of the config and its help section to reflect its more general role. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu>	2011-06-23 00:44:38 +02:00
Tejun Heo	4b9d33e6d8	ptrace: kill clone/exec tracehooks At this point, tracehooks aren't useful to mainline kernel and mostly just add an extra layer of obfuscation. Although they have comments, without actual in-kernel users, it is difficult to tell what are their assumptions and they're actually trying to achieve. To mainline kernel, they just aren't worth keeping around. This patch kills the following clone and exec related tracehooks. tracehook_prepare_clone() tracehook_finish_clone() tracehook_report_clone() tracehook_report_clone_complete() tracehook_unsafe_exec() The changes are mostly trivial - logic is moved to the caller and comments are merged and adjusted appropriately. The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE* are OR'd to bprm->unsafe instead of setting it, which produces the same result as the field is always zero on entry. It also tests p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which also gives the same result. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-22 19:26:29 +02:00
Tejun Heo	a288eecce5	ptrace: kill trivial tracehooks At this point, tracehooks aren't useful to mainline kernel and mostly just add an extra layer of obfuscation. Although they have comments, without actual in-kernel users, it is difficult to tell what are their assumptions and they're actually trying to achieve. To mainline kernel, they just aren't worth keeping around. This patch kills the following trivial tracehooks. * Ones testing whether task is ptraced. Replace with ->ptrace test. tracehook_expect_breakpoints() tracehook_consider_ignored_signal() tracehook_consider_fatal_signal() * ptrace_event() wrappers. Call directly. tracehook_report_exec() tracehook_report_exit() tracehook_report_vfork_done() * ptrace_release_task() wrapper. Call directly. tracehook_finish_release_task() * noop tracehook_prepare_release_task() tracehook_report_death() This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-22 19:26:28 +02:00
Tejun Heo	d21142ece4	ptrace: kill task_ptrace() task_ptrace(task) simply dereferences task->ptrace and isn't even used consistently only adding confusion. Kill it and directly access ->ptrace instead. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-22 19:26:27 +02:00
Peter Zijlstra	dd4e5d3ac4	lockdep: Fix trace_[soft,hard]irqs_[on,off]() recursion Commit: `1efc5da3cf`: [PATCH] order of lockdep off/on in vprintk() should be changed explains the reason for having raw_local_irq_*() and lockdep_off() in printk(). Instead of working around the broken recursion detection of interrupt state tracking, fix it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: efault@gmx.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110621153806.185242734@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-22 11:39:34 +02:00
Peter Zijlstra	4f2a8d3cf5	printk: Fix console_sem vs logbuf_lock unlock race Fix up the fallout from commit `0b5e1c5255` ("printk: Release console_sem after logbuf_lock"). The reason for unlocking the console_sem under the logbuf_lock is that a concurrent printk() might fill up the buffer but fail to acquire the console sem, resulting in a missed write to the console until a subsequent console_sem acquire/release cycle. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: efault@gmx.de Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1308734409.1022.14.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-22 11:39:34 +02:00
John Stultz	cb33217b1b	time: Avoid accumulating time drift in suspend/resume Because the read_persistent_clock interface is usually backed by only a second granular interface, each time we read from the persistent clock for suspend/resume, we introduce a half second (on average) of error. In order to avoid this error accumulating as the system is suspended over and over, this patch measures the time delta between the persistent clock and the system CLOCK_REALTIME. If the delta is less then 2 seconds from the last suspend, we compensate by using the previous time delta (keeping it close). If it is larger then 2 seconds, we assume the clock was set or has been changed, so we do no correction and update the delta. Note: If NTP is running, ths could seem to "fight" with the NTP corrected time, where as if the system time was off by 1 second, and NTP slewed the value in, a suspend/resume cycle could undo this correction, by trying to restore the previous offset from the persistent clock. However, without this patch, since each read could cause almost a full second worth of error, its possible to get almost 2 seconds of error just from the suspend/resume cycle alone, so this about equal to any offset added by the compensation. Further on systems that suspend/resume frequently, this should keep time closer then NTP could compensate for if the errors were allowed to accumulate. Credits to Arve Hjønnevåg for suggesting this solution. CC: Arve Hjønnevåg <arve@android.com> CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-06-21 16:55:37 -07:00
John Stultz	cb5de2f8d0	time: Catch invalid timespec sleep values in __timekeeping_inject_sleeptime Arve suggested making sure we catch possible negative sleep time intervals that could be passed into timekeeping_inject_sleeptime. CC: Arve Hjønnevåg <arve@android.com> CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-06-21 16:55:36 -07:00
John Stultz	1c6b39ad3f	alarmtimers: Return -ENOTSUPP if no RTC device is present Toralf Förster and Richard Weinberger noted that if there is no RTC device, the alarm timers core prints out an annoying "ALARM timers will not wake from suspend" message. This warning has been removed in a previous patch, however the issue still remains: The original idea was to support alarm timers even if there was no rtc device, as long as the system didn't go into suspend. However, after further consideration, communicating to the application that alarmtimers are not fully functional seems like the better solution. So this patch makes it so we return -ENOTSUPP to any posix _ALARM clockid calls if there is no backing RTC device on the system. Further this changes the behavior where when there is no rtc device we will check for one on clock_getres, clock_gettime, timer_create, and timer_nsleep instead of on suspend. CC: Toralf Förster <toralf.foerster@gmx.de> CC: Richard Weinberger <richard@nod.at CC: Peter Zijlstra <peterz@infradead.org> CC: Thomas Gleixner <tglx@linutronix.de> Reported-by: Toralf Förster <toralf.foerster@gmx.de> Reported by: Richard Weinberger <richard@nod.at> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-06-21 16:32:28 -07:00
John Stultz	c008ba58af	alarmtimers: Handle late rtc module loading The alarmtimers code currently picks a rtc device to use at late init time. However, if your rtc driver is loaded as a module, it may be registered after the alarmtimers late init code, leaving the alarmtimers nonfunctional. This patch moves the the rtcdevice selection to when we actually try to use it, allowing us to make use of rtc modules that may have been loaded at any point since bootup. CC: Thomas Gleixner <tglx@linutronix.de> CC: Meelis Roos <mroos@ut.ee> Reported-by: Meelis Roos <mroos@ut.ee> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-06-21 15:38:33 -07:00
Michal Kubecek	8440f4b194	PM: Free memory bitmaps if opening /dev/snapshot fails When opening /dev/snapshot device, snapshot_open() creates memory bitmaps which are freed in snapshot_release(). But if any of the callbacks called by pm_notifier_call_chain() returns NOTIFY_BAD, open() fails, snapshot_release() is never called and bitmaps are not freed. Next attempt to open /dev/snapshot then triggers BUG_ON() check in create_basic_memory_bitmaps(). This happens e.g. when vmwatchdog module is active on s390x. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@kernel.org	2011-06-21 23:20:06 +02:00
Linus Torvalds	8816ead9d8	Merge branches 'perf-urgent-for-linus', 'sched-urgent-for-linus', 'timers-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tools/perf: Fix static build of perf tool tracing: Fix regression in printk_formats file * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Make watchdog robust vs. interruption timerfd: Fix wakeup of processes when timer is cancelled on clock change * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, MAINTAINERS: Add x86 MCE people x86, efi: Do not reserve boot services regions within reserved areas	2011-06-19 09:00:18 -07:00
Linus Torvalds	357ed6b1a1	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Move RCU_BOOST #ifdefs to header file rcu: use softirq instead of kthreads except when RCU_BOOST=y rcu: Use softirq to address performance regression rcu: Simplify curing of load woes	2011-06-19 08:56:56 -07:00
David Howells	879669961b	KEYS/DNS: Fix ____call_usermodehelper() to not lose the session keyring ____call_usermodehelper() now erases any credentials set by the subprocess_inf::init() function. The problem is that commit `17f60a7da1` ("capabilites: allow the application of capability limits to usermode helpers") creates and commits new credentials with prepare_kernel_cred() after the call to the init() function. This wipes all keyrings after umh_keys_init() is called. The best way to deal with this is to put the init() call just prior to the commit_creds() call, and pass the cred pointer to init(). That means that umh_keys_init() and suchlike can modify the credentials _before_ they are published and potentially in use by the rest of the system. This prevents request_key() from working as it is prevented from passing the session keyring it set up with the authorisation token to /sbin/request-key, and so the latter can't assume the authority to instantiate the key. This causes the in-kernel DNS resolver to fail with ENOKEY unconditionally. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-17 09:40:48 -07:00
Takao Indoh	d8ad7d1123	generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts There is a problem that kdump(2nd kernel) sometimes hangs up due to a pending IPI from 1st kernel. Kernel panic occurs because IPI comes before call_single_queue is initialized. To fix the crash, rename init_call_single_data() to call_function_init() and call it in start_kernel() so that call_single_queue can be initialized before enabling interrupts. The details of the crash are: (1) 2nd kernel boots up (2) A pending IPI from 1st kernel comes when irqs are first enabled in start_kernel(). (3) Kernel tries to handle the interrupt, but call_single_queue is not initialized yet at this point. As a result, in the generic_smp_call_function_single_interrupt(), NULL pointer dereference occurs when list_replace_init() tries to access &q->list.next. Therefore this patch changes the name of init_call_single_data() to call_function_init() and calls it before local_irq_enable() in start_kernel(). Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Milton Miller <miltonm@bga.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/D6CBEE2F420741indou.takao@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-17 10:17:12 +02:00
Paul E. McKenney	f8b7fc6b51	rcu: Move RCU_BOOST #ifdefs to header file The commit "use softirq instead of kthreads except when RCU_BOOST=y" just applied #ifdef in place. This commit is a cleanup that moves the newly #ifdef'ed code to the header file kernel/rcutree_plugin.h. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-06-16 16:12:05 -07:00
Tejun Heo	544b2c91a9	ptrace: implement PTRACE_LISTEN The previous patch implemented async notification for ptrace but it only worked while trace is running. This patch introduces PTRACE_LISTEN which is suggested by Oleg Nestrov. It's allowed iff tracee is in STOP trap and puts tracee into quasi-running state - tracee never really runs but wait(2) and ptrace(2) consider it to be running. While ptracer is listening, tracee is allowed to re-enter STOP to notify an async event. Listening state is cleared on the first notification. Ptracer can also clear it by issuing INTERRUPT - tracee will re-trap into STOP with listening state cleared. This allows ptracer to monitor group stop state without running tracee - use INTERRUPT to put tracee into STOP trap, issue LISTEN and then wait(2) to wait for the next group stop event. When it happens, PTRACE_GETSIGINFO provides information to determine the current state. Test program follows. #define PTRACE_SEIZE 0x4206 #define PTRACE_INTERRUPT 0x4207 #define PTRACE_LISTEN 0x4208 #define PTRACE_SEIZE_DEVEL 0x80000000 static const struct timespec ts1s = { .tv_sec = 1 }; int main(int argc, char *argv) { pid_t tracee, tracer; int i; tracee = fork(); if (!tracee) while (1) pause(); tracer = fork(); if (!tracer) { siginfo_t si; ptrace(PTRACE_SEIZE, tracee, NULL, (void )(unsigned long)PTRACE_SEIZE_DEVEL); ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL); repeat: waitid(P_PID, tracee, NULL, WSTOPPED); ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si); if (!si.si_code) { printf("tracer: SIG %d\n", si.si_signo); ptrace(PTRACE_CONT, tracee, NULL, (void *)(unsigned long)si.si_signo); goto repeat; } printf("tracer: stopped=%d signo=%d\n", si.si_signo != SIGTRAP, si.si_signo); if (si.si_signo != SIGTRAP) ptrace(PTRACE_LISTEN, tracee, NULL, NULL); else ptrace(PTRACE_CONT, tracee, NULL, NULL); goto repeat; } for (i = 0; i < 3; i++) { nanosleep(&ts1s, NULL); printf("mother: SIGSTOP\n"); kill(tracee, SIGSTOP); nanosleep(&ts1s, NULL); printf("mother: SIGCONT\n"); kill(tracee, SIGCONT); } nanosleep(&ts1s, NULL); kill(tracer, SIGKILL); kill(tracee, SIGKILL); return 0; } This is identical to the program to test TRAP_NOTIFY except that tracee is PTRACE_LISTEN'd instead of PTRACE_CONT'd when group stopped. This allows ptracer to monitor when group stop ends without running tracee. # ./test-listen tracer: stopped=0 signo=5 mother: SIGSTOP tracer: SIG 19 tracer: stopped=1 signo=19 mother: SIGCONT tracer: stopped=0 signo=5 tracer: SIG 18 mother: SIGSTOP tracer: SIG 19 tracer: stopped=1 signo=19 mother: SIGCONT tracer: stopped=0 signo=5 tracer: SIG 18 mother: SIGSTOP tracer: SIG 19 tracer: stopped=1 signo=19 mother: SIGCONT tracer: stopped=0 signo=5 tracer: SIG 18 -v2: Moved JOBCTL_LISTENING check in wait_task_stopped() into task_stopped_code() as suggested by Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>	2011-06-16 21:41:54 +02:00
Tejun Heo	fb1d910c17	ptrace: implement TRAP_NOTIFY and use it for group stop events Currently there's no way for ptracer to find out whether group stop finished other than polling with INTERRUPT - GETSIGINFO - CONT sequence. This patch implements group stop notification for ptracer using STOP traps. When group stop state of a seized tracee changes, JOBCTL_TRAP_NOTIFY is set, which schedules a STOP trap which is sticky - it isn't cleared by other traps and at least one STOP trap will happen eventually. STOP trap is synchronization point for event notification and the tracer can determine the current group stop state by looking at the signal number portion of exit code (si_status from waitid(2) or si_code from PTRACE_GETSIGINFO). Notifications are generated both on start and end of group stops but, because group stop participation always happens before STOP trap, this doesn't cause an extra trap while tracee is participating in group stop. The symmetry will be useful later. Note that this notification works iff tracee is not trapped. Currently there is no way to be notified of group stop state changes while tracee is trapped. This will be addressed by a later patch. An example program follows. #define PTRACE_SEIZE 0x4206 #define PTRACE_INTERRUPT 0x4207 #define PTRACE_SEIZE_DEVEL 0x80000000 static const struct timespec ts1s = { .tv_sec = 1 }; int main(int argc, char *argv) { pid_t tracee, tracer; int i; tracee = fork(); if (!tracee) while (1) pause(); tracer = fork(); if (!tracer) { siginfo_t si; ptrace(PTRACE_SEIZE, tracee, NULL, (void )(unsigned long)PTRACE_SEIZE_DEVEL); ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL); repeat: waitid(P_PID, tracee, NULL, WSTOPPED); ptrace(PTRACE_GETSIGINFO, tracee, NULL, &si); if (!si.si_code) { printf("tracer: SIG %d\n", si.si_signo); ptrace(PTRACE_CONT, tracee, NULL, (void *)(unsigned long)si.si_signo); goto repeat; } printf("tracer: stopped=%d signo=%d\n", si.si_signo != SIGTRAP, si.si_signo); ptrace(PTRACE_CONT, tracee, NULL, NULL); goto repeat; } for (i = 0; i < 3; i++) { nanosleep(&ts1s, NULL); printf("mother: SIGSTOP\n"); kill(tracee, SIGSTOP); nanosleep(&ts1s, NULL); printf("mother: SIGCONT\n"); kill(tracee, SIGCONT); } nanosleep(&ts1s, NULL); kill(tracer, SIGKILL); kill(tracee, SIGKILL); return 0; } In the above program, tracer keeps tracee running and gets notification of each group stop state changes. # ./test-notify tracer: stopped=0 signo=5 mother: SIGSTOP tracer: SIG 19 tracer: stopped=1 signo=19 mother: SIGCONT tracer: stopped=0 signo=5 tracer: SIG 18 mother: SIGSTOP tracer: SIG 19 tracer: stopped=1 signo=19 mother: SIGCONT tracer: stopped=0 signo=5 tracer: SIG 18 mother: SIGSTOP tracer: SIG 19 tracer: stopped=1 signo=19 mother: SIGCONT tracer: stopped=0 signo=5 tracer: SIG 18 Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>	2011-06-16 21:41:53 +02:00
Tejun Heo	fca26f260c	ptrace: implement PTRACE_INTERRUPT Currently, there's no way to trap a running ptracee short of sending a signal which has various side effects. This patch implements PTRACE_INTERRUPT which traps ptracee without any signal or job control related side effect. The implementation is almost trivial. It uses the group stop trap - SIGTRAP \| PTRACE_EVENT_STOP << 8. A new trap flag JOBCTL_TRAP_INTERRUPT is added, which is set on PTRACE_INTERRUPT and cleared when any trap happens. As INTERRUPT should be useable regardless of the current state of tracee, task_is_traced() test in ptrace_check_attach() is skipped for INTERRUPT. PTRACE_INTERRUPT is available iff tracee is attached with PTRACE_SEIZE. Test program follows. #define PTRACE_SEIZE 0x4206 #define PTRACE_INTERRUPT 0x4207 #define PTRACE_SEIZE_DEVEL 0x80000000 static const struct timespec ts100ms = { .tv_nsec = 100000000 }; static const struct timespec ts1s = { .tv_sec = 1 }; static const struct timespec ts3s = { .tv_sec = 3 }; int main(int argc, char *argv) { pid_t tracee; tracee = fork(); if (tracee == 0) { nanosleep(&ts100ms, NULL); while (1) { printf("tracee: alive pid=%d\n", getpid()); nanosleep(&ts1s, NULL); } } if (argc > 1) kill(tracee, SIGSTOP); nanosleep(&ts100ms, NULL); ptrace(PTRACE_SEIZE, tracee, NULL, (void )(unsigned long)PTRACE_SEIZE_DEVEL); if (argc > 1) { waitid(P_PID, tracee, NULL, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, NULL); } nanosleep(&ts3s, NULL); printf("tracer: INTERRUPT and DETACH\n"); ptrace(PTRACE_INTERRUPT, tracee, NULL, NULL); waitid(P_PID, tracee, NULL, WSTOPPED); ptrace(PTRACE_DETACH, tracee, NULL, NULL); nanosleep(&ts3s, NULL); printf("tracer: exiting\n"); kill(tracee, SIGKILL); return 0; } When called without argument, tracee is seized from running state, interrupted and then detached back to running state. # ./test-interrupt tracee: alive pid=4546 tracee: alive pid=4546 tracee: alive pid=4546 tracer: INTERRUPT and DETACH tracee: alive pid=4546 tracee: alive pid=4546 tracee: alive pid=4546 tracer: exiting When called with argument, tracee is seized from stopped state, continued, interrupted and then detached back to stopped state. # ./test-interrupt 1 tracee: alive pid=4548 tracee: alive pid=4548 tracee: alive pid=4548 tracer: INTERRUPT and DETACH tracer: exiting Before PTRACE_INTERRUPT, once the tracee was running, there was no way to trap tracee and do PTRACE_DETACH without causing side effect. -v2: Updated to use task_set_jobctl_pending() so that it doesn't end up scheduling TRAP_STOP if child is dying which may make the child unkillable. Spotted by Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>	2011-06-16 21:41:53 +02:00
Tejun Heo	3544d72a0e	ptrace: implement PTRACE_SEIZE PTRACE_ATTACH implicitly issues SIGSTOP on attach which has side effects on tracee signal and job control states. This patch implements a new ptrace request PTRACE_SEIZE which attaches a tracee without trapping it or affecting its signal and job control states. The usage is the same with PTRACE_ATTACH but it takes PTRACE_SEIZE_* flags in @data. Currently, the only defined flag is PTRACE_SEIZE_DEVEL which is a temporary flag to enable PTRACE_SEIZE. PTRACE_SEIZE will change ptrace behaviors outside of attach itself. The changes will be implemented gradually and the DEVEL flag is to prevent programs which expect full SEIZE behavior from using it before all the behavior modifications are complete while allowing unit testing. The flag will be removed once SEIZE behaviors are completely implemented. * PTRACE_SEIZE, unlike ATTACH, doesn't force tracee to trap. After attaching tracee continues to run unless a trap condition occurs. * PTRACE_SEIZE doesn't affect signal or group stop state. * If PTRACE_SEIZE'd, group stop uses PTRACE_EVENT_STOP trap which uses exit_code of (signr \| PTRACE_EVENT_STOP << 8) where signr is one of the stopping signals if group stop is in effect or SIGTRAP otherwise, and returns usual trap siginfo on PTRACE_GETSIGINFO instead of NULL. Seizing sets PT_SEIZED in ->ptrace of the tracee. This flag will be used to determine whether new SEIZE behaviors should be enabled. Test program follows. #define PTRACE_SEIZE 0x4206 #define PTRACE_SEIZE_DEVEL 0x80000000 static const struct timespec ts100ms = { .tv_nsec = 100000000 }; static const struct timespec ts1s = { .tv_sec = 1 }; static const struct timespec ts3s = { .tv_sec = 3 }; int main(int argc, char *argv) { pid_t tracee; tracee = fork(); if (tracee == 0) { nanosleep(&ts100ms, NULL); while (1) { printf("tracee: alive\n"); nanosleep(&ts1s, NULL); } } if (argc > 1) kill(tracee, SIGSTOP); nanosleep(&ts100ms, NULL); ptrace(PTRACE_SEIZE, tracee, NULL, (void )(unsigned long)PTRACE_SEIZE_DEVEL); if (argc > 1) { waitid(P_PID, tracee, NULL, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, NULL); } nanosleep(&ts3s, NULL); printf("tracer: exiting\n"); return 0; } When the above program is called w/o argument, tracee is seized while running and remains running. When tracer exits, tracee continues to run and print out messages. # ./test-seize-simple tracee: alive tracee: alive tracee: alive tracer: exiting tracee: alive tracee: alive When called with an argument, tracee is seized from stopped state and continued, and returns to stopped state when tracer exits. # ./test-seize tracee: alive tracee: alive tracee: alive tracer: exiting # ps -el\|grep test-seize 1 T 0 4720 1 0 80 0 - 941 signal ttyS0 00:00:00 test-seize -v2: SEIZE doesn't schedule TRAP_STOP and leaves tracee running as Jan suggested. -v3: PTRACE_EVENT_STOP traps now report group stop state by signr. If group stop is in effect the stop signal number is returned as part of exit_code; otherwise, SIGTRAP. This was suggested by Denys and Oleg. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Oleg Nesterov <oleg@redhat.com>	2011-06-16 21:41:53 +02:00
Tejun Heo	73ddff2bee	job control: introduce JOBCTL_TRAP_STOP and use it for group stop trap do_signal_stop() implemented both normal group stop and trap for group stop while ptraced. This approach has been enough but scheduled changes require trap mechanism which can be used in more generic manner and using group stop trap for generic trap site simplifies both userland visible interface and implementation. This patch adds a new jobctl flag - JOBCTL_TRAP_STOP. When set, it triggers a trap site, which behaves like group stop trap, in get_signal_to_deliver() after checking for pending signals. While ptraced, do_signal_stop() doesn't stop itself. It initiates group stop if requested and schedules JOBCTL_TRAP_STOP and returns. The caller - get_signal_to_deliver() - is responsible for checking whether TRAP_STOP is pending afterwards and handling it. ptrace_attach() is updated to use JOBCTL_TRAP_STOP instead of JOBCTL_STOP_PENDING and __ptrace_unlink() to clear all pending trap bits and TRAPPING so that TRAP_STOP and future trap bits don't linger after detach. While at it, add proper function comment to do_signal_stop() and make it return bool. -v2: __ptrace_unlink() updated to clear JOBCTL_TRAP_MASK and TRAPPING instead of JOBCTL_PENDING_MASK. This avoids accidentally clearing JOBCTL_STOP_CONSUME. Spotted by Oleg. -v3: do_signal_stop() updated to return %false without dropping siglock while ptraced and TRAP_STOP check moved inside for(;;) loop after group stop participation. This avoids unnecessary relocking and also will help avoiding unnecessary traps by consuming group stop before handling pending traps. -v4: Jobctl trap handling moved into a separate function - do_jobctl_trap(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>	2011-06-16 21:41:52 +02:00
Thomas Gleixner	b5199515c2	clocksource: Make watchdog robust vs. interruption The clocksource watchdog code is interruptible and it has been observed that this can trigger false positives which disable the TSC. The reason is that an interrupt storm or a long running interrupt handler between the read of the watchdog source and the read of the TSC brings the two far enough apart that the delta is larger than the unstable treshold. Move both reads into a short interrupt disabled region to avoid that. Reported-and-tested-by: Vernon Mauery <vernux@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2011-06-16 19:30:53 +02:00
Ingo Molnar	b4f9f2b64a	Merge commit 'v3.0-rc3' into perf/core Merge reason: add the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-16 13:23:22 +02:00
Paul E. McKenney	a46e0899ee	rcu: use softirq instead of kthreads except when RCU_BOOST=y This patch #ifdefs RCU kthreads out of the kernel unless RCU_BOOST=y, thus eliminating context-switch overhead if RCU priority boosting has not been configured. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-06-15 23:07:21 -07:00
Linus Torvalds	a1b6ae8ed0	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Check if lowest_mask is initialized in find_lowest_rq() sched: Fix need_resched() when checking peempt	2011-06-15 21:45:18 -07:00
Josh Triplett	d2c3225879	gcov: disable CONFIG_CONSTRUCTORS when not needed by CONFIG_GCOV_KERNEL CONFIG_CONSTRUCTORS controls support for running constructor functions at kernel init time. According to commit `b99b87f70c` ("kernel: constructor support"), gcov (CONFIG_GCOV_KERNEL) needs this. However, CONFIG_CONSTRUCTORS currently defaults to y, with no option to disable it, and CONFIG_GCOV_KERNEL depends on it. Instead, default it to n and have CONFIG_GCOV_KERNEL select it, so that the normal case of CONFIG_GCOV_KERNEL=n will result in CONFIG_CONSTRUCTORS=n. Observed in the short list of =y values in a minimal kernel configuration. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-15 20:04:01 -07:00
KAMEZAWA Hiroyuki	733eda7ac3	memcg: clear mm->owner when last possible owner leaves The following crash was reported: > Call Trace: > [<ffffffff81139792>] mem_cgroup_from_task+0x15/0x17 > [<ffffffff8113a75a>] __mem_cgroup_try_charge+0x148/0x4b4 > [<ffffffff810493f3>] ? need_resched+0x23/0x2d > [<ffffffff814cbf43>] ? preempt_schedule+0x46/0x4f > [<ffffffff8113afe8>] mem_cgroup_charge_common+0x9a/0xce > [<ffffffff8113b6d1>] mem_cgroup_newpage_charge+0x5d/0x5f > [<ffffffff81134024>] khugepaged+0x5da/0xfaf > [<ffffffff81078ea0>] ? __init_waitqueue_head+0x4b/0x4b > [<ffffffff81133a4a>] ? add_mm_counter.constprop.5+0x13/0x13 > [<ffffffff81078625>] kthread+0xa8/0xb0 > [<ffffffff814d13e8>] ? sub_preempt_count+0xa1/0xb4 > [<ffffffff814d5664>] kernel_thread_helper+0x4/0x10 > [<ffffffff814ce858>] ? retint_restore_args+0x13/0x13 > [<ffffffff8107857d>] ? __init_kthread_worker+0x5a/0x5a What happens is that khugepaged tries to charge a huge page against an mm whose last possible owner has already exited, and the memory controller crashes when the stale mm->owner is used to look up the cgroup to charge. mm->owner has never been set to NULL with the last owner going away, but nobody cared until khugepaged came along. Even then it wasn't a problem because the final mmput() on an mm was forced to acquire and release mmap_sem in write-mode, preventing an exiting owner to go away while the mmap_sem was held, and until "692e0b3 mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge was protected by mmap_sem in read-mode. Instead of going back to relying on the mmap_sem to enforce lifetime of a task, this patch ensures that mm->owner is properly set to NULL when the last possible owner is exiting, which the memory controller can handle just fine. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Dave Jones <davej@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-15 20:04:01 -07:00
Steven Rostedt	0da938c449	sched: Check if lowest_mask is initialized in find_lowest_rq() On system boot up, the lowest_mask is initialized with an early_initcall(). But RT tasks may wake up on other early_initcall() callers before the lowest_mask is initialized, causing a system crash. Commit "d72bce0e67 rcu: Cure load woes" was the first commit to wake up RT tasks in early init. Before this commit this bug should not happen. Reported-by: Andrew Theurer <habanero@linux.vnet.ibm.com> Tested-by: Andrew Theurer <habanero@linux.vnet.ibm.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20110614223657.824872966@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-15 11:44:48 +02:00
Hillf Danton	8dd0de8be3	sched: Fix need_resched() when checking peempt The RT preempt check tests the wrong task if NEED_RESCHED is set. It currently checks the local CPU task. It is supposed to check the task that is running on the runqueue we are about to wake another task on. Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20110614223657.450239027@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-15 09:50:32 +02:00
Masami Hiramatsu	1fd8df2c39	tracing/kprobes: Fix kprobe-tracer to support stack trace Fix to support kernel stack trace correctly on kprobe-tracer. Since the execution path of kprobe-based dynamic events is different from other tracepoint-based events, normal ftrace_trace_stack() doesn't work correctly. To fix that, this introduces ftrace_trace_stack_regs() which traces stack via pt_regs instead of current stack register. e.g. # echo p schedule+4 > /sys/kernel/debug/tracing/kprobe_events # echo 1 > /sys/kernel/debug/tracing/options/stacktrace # echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable # head -n 20 /sys/kernel/debug/tracing/trace bash-2968 [000] 10297.050245: p_schedule_4: (schedule+0x4/0x4ca) bash-2968 [000] 10297.050247: <stack trace> => schedule_timeout => n_tty_read => tty_read => vfs_read => sys_read => system_call_fastpath kworker/0:1-2940 [000] 10297.050265: p_schedule_4: (schedule+0x4/0x4ca) kworker/0:1-2940 [000] 10297.050266: <stack trace> => worker_thread => kthread => kernel_thread_helper sshd-1132 [000] 10297.050365: p_schedule_4: (schedule+0x4/0x4ca) sshd-1132 [000] 10297.050365: <stack trace> => sysret_careful Note: Even with this fix, the first entry will be skipped if the probe is put on the function entry area before the frame pointer is set up (usually, that is 4 bytes (push %bp; mov %sp %bp) on x86), because stack unwinder depends on the frame pointer. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Namhyung Kim <namhyung@gmail.com> Link: http://lkml.kernel.org/r/20110608070934.17777.17116.stgit@fedora15 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:53 -04:00
Masami Hiramatsu	c624d33f61	stack_trace: Add weak save_stack_trace_regs() Add weak symbol of save_stack_trace_regs() as same as save_stack_trace_tsk() since that is not implemented except x86 yet. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Namhyung Kim <namhyung@gmail.com> Link: http://lkml.kernel.org/r/20110608070927.17777.37895.stgit@fedora15 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:52 -04:00
Vaibhav Nagarnaik	d7ec4bfed6	ring-buffer: Set __GFP_NORETRY flag for ring buffer allocating process The tracing ring buffer is allocated from kernel memory. While allocating a large chunk of memory, OOM might happen which destabilizes the system. Thus random processes might get killed during the allocation. This patch adds __GFP_NORETRY flag to the ring buffer allocation calls to make it fail more gracefully if the system will not be able to complete the allocation request. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Link: http://lkml.kernel.org/r/1307491302-9236-1-git-send-email-vnagarnaik@google.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:51 -04:00
Peter Huewe	22fe9b54d8	tracing: Convert to kstrtoul_from_user This patch replaces the code for getting an unsigned long from a userspace buffer by a simple call to kstroul_from_user. This makes it easier to read and less error prone. Signed-off-by: Peter Huewe <peterhuewe@gmx.de> Link: http://lkml.kernel.org/r/1307476707-14762-1-git-send-email-peterhuewe@gmx.de Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:50 -04:00
Jiri Olsa	749230b06a	tracing, function_graph: Add context-info support for function_graph tracer The function_graph tracer does not follow global context-info option. Adding TRACE_ITER_CONTEXT_INFO trace_flags check to enable it. With following commands: # echo function_graph > ./current_tracer # echo 0 > options/context-info # cat trace This is what it looked like before: # tracer: function_graph # # TIME CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| \| 1) 0.079 us \| } /* __vma_link_rb / 1) 0.056 us \| copy_page_range(); 1) \| security_vm_enough_memory() { ... This is what it looks like now: # tracer: function_graph # } / update_ts_time_stats */ timekeeping_max_deferment(); ... Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1307113131-10045-6-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:49 -04:00
Jiri Olsa	199abfab40	tracing, function_graph: Remove lock-depth from latency trace The lock_depth was removed in commit `e6e1e25` tracing: Remove lock_depth from event entry Removing the lock_depth info from function_graph latency header. With following commands: # echo function_graph > ./current_tracer # echo 1 > options/latency-format # cat trace This is what it looked like before: # tracer: function_graph # # function_graph latency trace v1.1.5 on 3.0.0-rc1-tip+ # -------------------------------------------------------------------- # latency: 0 us, #59756/311298, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) # ----------------- # \| task: -0 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / _-=> lock-depth # \|\|\|\| / # CPU\|\|\|\|\| DURATION FUNCTION CALLS # \| \|\|\|\|\| \| \| \| \| \| \| 0) .... 0.068 us \| } /* __rcu_read_unlock / ... This is what it looks like now: # tracer: function_graph # # function_graph latency trace v1.1.5 on 3.0.0-rc1-tip+ # -------------------------------------------------------------------- # latency: 0 us, #59747/1744610, CPU#0 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) # ----------------- # \| task: -0 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / # CPU\|\|\|\| DURATION FUNCTION CALLS # \| \|\|\|\| \| \| \| \| \| \| 0) ..s. 1.641 us \| } / __rcu_process_callbacks */ ... Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1307113131-10045-5-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:49 -04:00
Jiri Olsa	f56e7f8efb	tracing, function: Fix trace header to follow context-info option The header display of function tracer does not follow the context-info option, so field names are displayed even if this option is off. Added check for TRACE_ITER_CONTEXT_INFO trace_flags. With following commands: # echo function > ./current_tracer # echo 0 > options/context-info # cat trace This is what it looked like before: # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION # \| \| \| \| \| add_preempt_count <-schedule rcu_note_context_switch <-schedule ... This is what it looks like now: # tracer: function # _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel ... Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1307113131-10045-4-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:48 -04:00
Jiri Olsa	ffeb80fc30	tracing, function_graph: Merge overhead and duration display functions Functions print_graph_overhead() and print_graph_duration() displays data for one field - DURATION. I merged them into single function print_graph_duration(), and added a way to display the empty parts of the field. This way the print_graph_irq() function can use this column to display the IRQ signs if needed and the DURATION field details stays inside the print_graph_duration() function. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1307113131-10045-3-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:47 -04:00
Jiri Olsa	321e68b095	tracing, function_graph: Remove dependency of abstime and duration fields on latency The display of absolute time and duration fields is based on the latency field. This was added during the irqsoff/wakeup tracers graph support changes. It's causing confusion in what fields will be displayed for the function_graph tracer itself. So I'm removing this depency, and adding absolute time and duration fields to the preemptirqsoff preemptoff irqsoff wakeup tracers. With following commands: # echo function_graph > ./current_tracer # cat trace This is what it looked like before: # tracer: function_graph # # TIME CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| \| 0) 0.068 us \| } /* page_add_file_rmap / 0) \| _raw_spin_unlock() { ... This is what it looks like now: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| 0) 0.068 us \| } / add_preempt_count / 0) 0.993 us \| } / vfsmount_lock_local_lock */ ... For preemptirqsoff preemptoff irqsoff wakeup tracers, this is what it looked like before: SNIP # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / _-=> lock-depth # \|\|\|\| / # CPU TASK/PID \|\|\|\|\| DURATION FUNCTION CALLS # \| \| \| \|\|\|\|\| \| \| \| \| \| \| 1) <idle>-0 \| d..1 0.000 us \| acpi_idle_enter_simple(); ... This is what it looks like now: SNIP # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / # TIME CPU TASK/PID \|\|\|\| DURATION FUNCTION CALLS # \| \| \| \| \|\|\|\| \| \| \| \| \| \| 19.847735 \| 1) <idle>-0 \| d..1 0.000 us \| acpi_idle_enter_simple(); ... Signed-off-by: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1307113131-10045-2-git-send-email-jolsa@redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:47 -04:00
Paul McQuade	84c15027a7	async: Fixed an include coding style issue Added <linux/atomic.h>,<linux/ktime.h> and Removed <asm/atomic.h>. Added KERN_DEBUG to printk() functions. Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Paul McQuade <tungstentide@gmail.com> Link: http://lkml.kernel.org/r/4DE596B4.7030904@gmail.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:46 -04:00
Paul McQuade	bd38c0e6f9	ftrace: Fixed an include coding style issue Removed <asm/ftrace.h> because <linux/ftrace.h> was already declared. Braces of struct's coding style fixed. Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul McQuade <tungstentide@gmail.com> Link: http://lkml.kernel.org/r/4DE59711.3090900@gmail.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:45 -04:00
Steven Rostedt	cf30cf67d6	tracing: Add disable_on_free option Add a trace option to disable tracing on free. When this option is set, a write into the free_buffer file will not only shrink the ring buffer down to zero, but it will also disable tracing. Cc: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:45 -04:00
Vaibhav Nagarnaik	4f271a2a60	tracing: Add a proc file to stop tracing and free buffer The proc file entry buffer_size_kb is used to set the size of tracing buffer. The memory to expand the buffer size is kernel memory. Consider a use case where tracing is handled by a user space utility, which acts as a gate keeper for tracing requests. In an OOM condition, tracing is considered a low priority task and if the utility gets killed the ring buffer memory cannot be released back to the kernel. This patch adds a proc file called "free_buffer" whose purpose is to stop tracing and free up the ring buffer when it is closed. The user space process can then set the desired size in buffer_size_kb file and open the fd to the "free_buffer" file. Under OOM condition, if the process gets killed, the kernel closes the file descriptor. The release handler stops the tracing and releases the kernel memory automatically. Cc: Ingo Molnar <mingo@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Link: http://lkml.kernel.org/r/1308012717-11148-1-git-send-email-vnagarnaik@google.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:48:37 -04:00
Randy Dunlap	ada9c93312	signal.c: fix kernel-doc notation Fix kernel-doc warnings in signal.c: Warning(kernel/signal.c:2374): No description found for parameter 'nset' Warning(kernel/signal.c:2374): Excess function parameter 'set' description in 'sys_rt_sigprocmask' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-06-14 19:12:17 -07:00
Vaibhav Nagarnaik	7ea5906405	tracing: Use NUMA allocation for per-cpu ring buffer pages The tracing ring buffer is a group of per-cpu ring buffers where allocation and logging is done on a per-cpu basis. The events that are generated on a particular CPU are logged in the corresponding buffer. This is to provide wait-free writes between CPUs and good NUMA node locality while accessing the ring buffer. However, the allocation routines consider NUMA locality only for buffer page metadata and not for the actual buffer page. This causes the pages to be allocated on the NUMA node local to the CPU where the allocation routine is running at the time. This patch fixes the problem by using a NUMA node specific allocation routine so that the pages are allocated from a NUMA node local to the logging CPU. I tested with the getuid_microbench from autotest. It is a simple binary that calls getuid() in a loop and measures the average time for the syscall to complete. The following command was used to test: $ getuid_microbench 1000000 Compared the numbers found on kernel with and without this patch and found that logging latency decreases by 30-50 ns/call. tracing with non-NUMA allocation - 569 ns/call tracing with NUMA allocation - 512 ns/call Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Link: http://lkml.kernel.org/r/1304470602-20366-1-git-send-email-vnagarnaik@google.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 22:04:39 -04:00
Vaibhav Nagarnaik	e7e2ee89a9	tracing: Schedule a delayed work to call wakeup() In using syscall tracing by concurrent processes, the wakeup() that is called in the event commit function causes contention on the spin lock of the waitqueue. I enabled sys_enter_getuid and sys_exit_getuid tracepoints, and by running getuid_microbench from autotest in parallel I found that the contention causes exponential latency increase in the tracing path. The autotest binary getuid_microbench calls getuid() in a tight loop for the given number of iterations and measures the average time required to complete a single invocation of syscall. The patch schedules a delayed work after 2 ms once an event commit calls to wake up the trace wait_queue. This removes the delay caused by contention on spin lock in wakeup() and amortizes the wakeup() calls scheduled over the 2 ms period. In the following example, the script enables the sys_enter_getuid and sys_exit_getuid tracepoints and runs the getuid_microbench in parallel with the given number of processes. The output clearly shows the latency increase caused by contentions. $ ~/getuid.sh 1 1000000 calls in 0.720974253 s (720.974253 ns/call) $ ~/getuid.sh 2 1000000 calls in 1.166457554 s (1166.457554 ns/call) 1000000 calls in 1.168933765 s (1168.933765 ns/call) $ ~/getuid.sh 3 1000000 calls in 1.783827516 s (1783.827516 ns/call) 1000000 calls in 1.795553270 s (1795.553270 ns/call) 1000000 calls in 1.796493376 s (1796.493376 ns/call) $ ~/getuid.sh 4 1000000 calls in 4.483041796 s (4483.041796 ns/call) 1000000 calls in 4.484165388 s (4484.165388 ns/call) 1000000 calls in 4.484850762 s (4484.850762 ns/call) 1000000 calls in 4.485643576 s (4485.643576 ns/call) $ ~/getuid.sh 5 1000000 calls in 6.497521653 s (6497.521653 ns/call) 1000000 calls in 6.502000236 s (6502.000236 ns/call) 1000000 calls in 6.501709115 s (6501.709115 ns/call) 1000000 calls in 6.502124100 s (6502.124100 ns/call) 1000000 calls in 6.502936358 s (6502.936358 ns/call) After the patch, the latencies scale better. 1000000 calls in 0.728720455 s (728.720455 ns/call) 1000000 calls in 0.842782857 s (842.782857 ns/call) 1000000 calls in 0.883803135 s (883.803135 ns/call) 1000000 calls in 0.902077764 s (902.077764 ns/call) 1000000 calls in 0.902838202 s (902.838202 ns/call) 1000000 calls in 0.908896885 s (908.896885 ns/call) 1000000 calls in 0.932523515 s (932.523515 ns/call) 1000000 calls in 0.958009672 s (958.009672 ns/call) 1000000 calls in 0.986188020 s (986.188020 ns/call) 1000000 calls in 0.989771102 s (989.771102 ns/call) 1000000 calls in 0.933518391 s (933.518391 ns/call) 1000000 calls in 0.958897947 s (958.897947 ns/call) 1000000 calls in 1.031038897 s (1031.038897 ns/call) 1000000 calls in 1.089516025 s (1089.516025 ns/call) 1000000 calls in 1.141998347 s (1141.998347 ns/call) Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1305059241-7629-1-git-send-email-vnagarnaik@google.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-14 21:59:41 -04:00
Shaohua Li	09223371de	rcu: Use softirq to address performance regression Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread) introduced performance regression. In an AIM7 test, this commit degraded performance by about 40%. The commit runs rcu callbacks in a kthread instead of softirq. We observed high rate of context switch which is caused by this. Out test system has 64 CPUs and HZ is 1000, so we saw more than 64k context switch per second which is caused by RCU's per-CPU kthread. A trace showed that most of the time the RCU per-CPU kthread doesn't actually handle any callbacks, but instead just does a very small amount of work handling grace periods. This means that RCU's per-CPU kthreads are making the scheduler do quite a bit of work in order to allow a very small amount of RCU-related processing to be done. Alex Shi's analysis determined that this slowdown is due to lock contention within the scheduler. Unfortunately, as Peter Zijlstra points out, the scheduler's real-time semantics require global action, which means that this contention is inherent in real-time scheduling. (Yes, perhaps someone will come up with a workaround -- otherwise, -rt is not going to do well on large SMP systems -- but this patch will work around this issue in the meantime. And "the meantime" might well be forever.) This patch therefore re-introduces softirq processing to RCU, but only for core RCU work. RCU callbacks are still executed in kthread context, so that only a small amount of RCU work runs in softirq context in the common case. This should minimize ksoftirqd execution, allowing us to skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Tested-by: "Alex,Shi" <alex.shi@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-06-14 15:25:39 -07:00
Paul E. McKenney	9a43273690	rcu: Simplify curing of load woes Make the functions creating the kthreads wake them up. Leverage the fact that the per-node and boost kthreads can run anywhere, thus dispensing with the need to wake them up once the incoming CPU has gone fully online. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Daniel J Blueman <daniel.blueman@gmail.com>	2011-06-14 15:25:15 -07:00
Linus Torvalds	c78a9b9b8e	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ftrace: Revert `8ab2b7efd` ftrace: Remove unnecessary disabling of irqs kprobes/trace: Fix kprobe selftest for gcc 4.6 ftrace: Fix possible undefined return code oprofile, dcookies: Fix possible circular locking dependency oprofile: Fix locking dependency in sync_start() oprofile: Free potentially owned tasks in case of errors oprofile, x86: Add comments to IBS LVT offset initialization	2011-06-13 10:45:49 -07:00
Frederic Weisbecker	bdd4e85dc3	sched: Isolate preempt counting in its own config option Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec of preempt count offset independently. So that the offset can be updated by preempt_disable() and preempt_enable() even without the need for CONFIG_PREEMPT beeing set. This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working with !CONFIG_PREEMPT where it currently doesn't detect code that sleeps inside explicit preemption disabled sections. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>	2011-06-10 15:15:40 +02:00
Joe Perches	28f65c11f2	treewide: Convert uses of struct resource to resource_size(ptr) Several fixes as well where the +1 was missing. Done via coccinelle scripts like: @@ struct resource *ptr; @@ - ptr->end - ptr->start + 1 + resource_size(ptr) and some grep and typing. Mostly uncompiled, no cross-compilers. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-06-10 14:55:36 +02:00
Jesper Juhl	13863a66c9	genirq: Prevent potential NULL dereference in irq_set_irq_wake() In kernel/irq/manage.c::irq_set_irq_wake() we call irq_get_desc_buslock() which may return NULL, but the code dereferences the result unconditionally. irq_set_irq_wake() has lots of callers - I checked a few and I couldn't find anything that guarantees that they won't call it with some input that will cause irq_get_desc_buslock() to return NULL, so I think it's a good thing to test and -EINVAL was the most sane error code in this situation that I could think of. Not all callers test the return value of irq_set_irq_wake(), but those that do take != 0 to mean error as far as I can see, so they should be fine. I guess those that don't test actually should, but that's a different issue. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1106092300360.17868@swampdragon.chaosbits.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-06-10 10:53:42 +02:00
Steven Rostedt	db5e7ecc4a	tracing: Fix regression in printk_formats file The fix to fix the printk_formats of modules broke the printk_formats of trace_printks in the kernel. The update of what to show via the seq_file was only updated if the passed in fmt was NULL, which happens only on the first iteration. The result was showing the first format every time instead of iterating through the available formats. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-09 08:42:15 -04:00
Frederic Weisbecker	76369139ce	perf: Split up buffer handling from core code And create the internal perf events header. v2: Keep an internal inlined perf_output_copy() Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Borislav Petkov <bp@alien8.de> Cc: Stephane Eranian <eranian@google.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1305827704-5607-1-git-send-email-fweisbec@gmail.com [ v3: use clearer 'ring_buffer' and 'rb' naming ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-09 12:57:54 +02:00
eparis@redhat	2ce9738bac	cgroupfs: use init_cred when populating new cgroupfs mount We recently found that in some configurations SELinux was blocking the ability for cgroupfs to be mounted. The reason for this is because cgroupfs creates files and directories during the get_sb() call and also uses lookup_one_len() during that same get_sb() call. This is a problem since the security subsystem cannot initialize the superblock and the inodes in that filesystem until after the get_sb() call returns. Thus we leave the inodes in an unitialized state during get_sb(). For the vast majority of filesystems this is not an issue, but since cgroupfs uses lookup_on_len() it does search permission checks on the directories in the path it walks. Since the inode security state is not set up SELinux does these checks as if the inodes were 'unlabeled.' Many 'normal' userspace process do not have permission to interact with unlabeled inodes. The solution presented here is to do the permission checks of path walk and inode creation as the kernel rather than as the task that called mount. Since the kernel has permission to read/write/create unlabeled inodes the get_sb() call will complete successfully and the SELinux code will be able to initialize the superblock and those inodes created during the get_sb() call. This appears to be the same solution used by other filesystems such as devtmpfs to solve the same issue and should thus have no negative impact on other LSMs which currently work. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: James Morris <jmorris@namei.org>	2011-06-09 11:59:53 +10:00
Linus Torvalds	33726bf214	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix comments in include/linux/perf_event.h perf: Comment /proc/sys/kernel/perf_event_paranoid to be part of user ABI perf python: Fix argument name list of read_on_cpu() perf evlist: Don't die if sample_{id_all\|type} is invalid perf python: Use exception to propagate errors perf evlist: Remove dependency on debug routines perf, cgroups: Fix up for new API	2011-06-08 08:36:15 -07:00
Linus Torvalds	cb0a02ecf9	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Ensure we locate the passed IRQ in irq_alloc_descs() genirq: Fix descriptor init on non-sparse IRQs irq: Handle spurios irq detection for threaded irqs genirq: Print threaded handler in spurious debug output	2011-06-07 19:21:11 -07:00
Linus Torvalds	6715a52a58	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix/clarify set_task_cpu() locking rules lockdep: Fix lock_is_held() on recursion sched: Fix schedstat.nr_wakeups_migrate sched: Fix cross-cpu clock sync on remote wakeups	2011-06-07 19:20:28 -07:00
Frederic Weisbecker	2da8c8bc44	sched: Remove pointless in_atomic() definition check It's really supposed to be defined here. If it's not then we actually want the build to crash so that we know it, and not keep it silent. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>	2011-06-07 22:53:39 +02:00
Steven Rostedt	a4f18ed11a	ftrace: Revert `8ab2b7efd` ftrace: Remove unnecessary disabling of irqs Revert the commit that removed the disabling of interrupts around the initial modifying of mcount callers to nops, and update the comment. The original comment was outdated and stated that the interrupts were being disabled to prevent kstop machine, which was required with the old ftrace daemon, but was no longer the case. What the comment failed to mention was that interrupts needed to be disabled to keep interrupts from preempting the modifying of the code and then executing the code that was partially modified. Revert the commit and update the comment. Reported-by: Richard W.M. Jones <rjones@redhat.com> Tested-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-07 14:49:19 -04:00
Steven Rostedt	265a5b7ee3	kprobes/trace: Fix kprobe selftest for gcc 4.6 With gcc 4.6, the self test kprobe function: kprobe_trace_selftest_target() is optimized such that kallsyms does not list it. The kprobes test uses this function to insert a probe and test it. But it will fail the test if the function is not listed in kallsyms. Adding a __used annotation keeps the symbol in the kallsyms table. Suggested-by: David Daney <ddaney@caviumnetworks.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-07 14:47:36 -04:00
Peter Zijlstra	b58f6b0dd3	perf, core: Fix initial task_ctx/event installation A lost Quilt refresh of `2c29ef0fef` (perf: Simplify and fix __perf_install_in_context()) is causing grief and lockups, reported by Jiri Olsa. When installing an event in a task context, there's a number of issues: - there might not be an existing task context, in which case we should install the now current context; - there might already be a context, not the current one, in which case we should de-schedule the old and install the new; these cases were dealt with in the lost refresh, however there is one further case that was found in testing: - there might already be a context, the current one, in which case we should still de-schedule, and should take care to re-install it (note that task_ctx_sched_out() clears cpuctx->task_ctx). Reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1307399008.2497.971.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-07 13:02:41 +02:00
Peter Zijlstra	0b5e1c5255	printk: Release console_sem after logbuf_lock Release console_sem after unlocking the logbuf_lock so that we don't generate wakeups while holding logbuf_lock. This avoids some lock inversion troubles once we remove the lockdep_off bits between logbuf_lock and rq->lock (prints while holding rq->lock vs doing wakeups while holding logbuf_lock). There's of course still an actual deadlock where the printk()s under rq->lock will issue a wakeup from the up() call, but lockdep won't warn about that since semaphores are not tracked. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-j8swthl12u73h4znbvitljzd@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-07 12:50:02 +02:00
Peter Zijlstra	6c6c54e180	sched: Fix/clarify set_task_cpu() locking rules Sergey reported a CONFIG_PROVE_RCU warning in push_rt_task where set_task_cpu() was called with both relevant rq->locks held, which should be sufficient for running tasks since holding its rq->lock will serialize against sched_move_task(). Update the comments and fix the task_group() lockdep test. Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1307115427.2353.3456.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-07 12:26:40 +02:00
Peter Zijlstra	f2513cde93	lockdep: Fix lock_is_held() on recursion The main lock_is_held() user is lockdep_assert_held(), avoid false assertions in lockdep_off() sections by unconditionally reporting the lock is taken. [ the reason this is important is a lockdep_assert_held() in ttwu() which triggers a warning under lockdep_off() as in printk() which can trigger another wakeup and lock up due to spinlock recursion, as reported and heroically debugged by Arne Jansen ] Reported-and-tested-by: Arne Jansen <lists@die-jansens.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1307398759.2497.966.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-07 12:25:50 +02:00
GuoWen Li	0aff1c0cef	ftrace: Fix possible undefined return code kernel/trace/ftrace.c: In function 'ftrace_regex_write.clone.15': kernel/trace/ftrace.c:2743:6: warning: 'ret' may be used uninitialized in this function Signed-off-by: GuoWen Li <guowen.li.linux@gmail.com> Link: http://lkml.kernel.org/r/201106011918.47939.guowen.li.linux@gmail.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-06-06 22:34:25 -04:00
Tejun Heo	dd1d677269	signal: remove three noop tracehooks Remove the following three noop tracehooks in signals.c. * tracehook_force_sigpending() * tracehook_get_signal() * tracehook_finish_jctl() The code area is about to be updated and these hooks don't do anything other than obfuscating the logic. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:11 +02:00
Tejun Heo	62c124ff3b	ptrace: use bit_waitqueue for TRAPPING instead of wait_chldexit ptracer->signal->wait_chldexit was used to wait for TRAPPING; however, ->wait_chldexit was already complicated with waker-side filtering without adding TRAPPING wait on top of it. Also, it unnecessarily made TRAPPING clearing depend on the current ptrace relationship - if the ptracee is detached, wakeup is lost. There is no reason to use signal->wait_chldexit here. We're just waiting for JOBCTL_TRAPPING bit to clear and given the relatively infrequent use of ptrace, bit_waitqueue can serve it perfectly. This patch makes JOBCTL_TRAPPING wait use bit_waitqueue instead of signal->wait_chldexit. -v2: Use JOBCTL_*_BIT macros instead of ilog2() as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:11 +02:00
Tejun Heo	7dd3db54e7	job control: introduce task_set_jobctl_pending() task->jobctl currently hosts JOBCTL_STOP_PENDING and will host TRAP pending bits too. Setting pending conditions on a dying task may make the task unkillable. Currently, each setting site is responsible for checking for the condition but with to-be-added job control traps this becomes too fragile. This patch adds task_set_jobctl_pending() which should be used when setting task->jobctl bits to schedule a stop or trap. The function performs the followings to ease setting pending bits. * Sanity checks. * If fatal signal is pending or PF_EXITING is set, no bit is set. * STOP_SIGMASK is automatically cleared if new value is being set. do_signal_stop() and ptrace_attach() are updated to use task_set_jobctl_pending() instead of setting STOP_PENDING explicitly. The surrounding structures around setting are changed to fit task_set_jobctl_pending() better but there should be no userland visible behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:11 +02:00
Tejun Heo	6dfca32984	job control: make task_clear_jobctl_pending() clear TRAPPING automatically JOBCTL_TRAPPING indicates that ptracer is waiting for tracee to (re)transit into TRACED. task_clear_jobctl_pending() must be called when either tracee enters TRACED or the transition is cancelled for some reason. The former is achieved by explicitly calling task_clear_jobctl_pending() in ptrace_stop() and the latter by calling it at the end of do_signal_stop(). Calling task_clear_jobctl_trapping() at the end of do_signal_stop() limits the scope TRAPPING can be used and is fragile in that seemingly unrelated changes to tracee's control flow can lead to stuck TRAPPING. We already have task_clear_jobctl_pending() calls on those cancelling events to clear JOBCTL_STOP_PENDING. Cancellations can be handled by making those call sites use JOBCTL_PENDING_MASK instead and updating task_clear_jobctl_pending() such that task_clear_jobctl_trapping() is called automatically if no stop/trap is pending. This patch makes the above changes and removes the fallback task_clear_jobctl_trapping() call from do_signal_stop(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:11 +02:00
Tejun Heo	3759a0d94c	job control: introduce JOBCTL_PENDING_MASK and task_clear_jobctl_pending() This patch introduces JOBCTL_PENDING_MASK and replaces task_clear_jobctl_stop_pending() with task_clear_jobctl_pending() which takes an extra @mask argument. JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but future patches will add more bits. recalc_sigpending_tsk() is updated to use JOBCTL_PENDING_MASK instead. task_clear_jobctl_pending() takes @mask which in subset of JOBCTL_PENDING_MASK and clears the relevant jobctl bits. If JOBCTL_STOP_PENDING is set, other STOP bits are cleared together. All task_clear_jobctl_stop_pending() users are updated to call task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is functionally identical to task_clear_jobctl_stop_pending(). This patch doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:10 +02:00
Tejun Heo	81be24b8cd	ptrace: relocate set_current_state(TASK_TRACED) in ptrace_stop() In ptrace_stop(), after arch hook is done, the task state and jobctl bits are updated while holding siglock. The ordering requirement there is that TASK_TRACED is set before JOBCTL_TRAPPING is cleared to prevent ptracer waiting on TRAPPING doesn't end up waking up TRACED is actually set and sees TASK_RUNNING in wait(2). Move set_current_state(TASK_TRACED) to the top of the block and reorganize comments. This makes the ordering more obvious (TASK_TRACED before other updates) and helps future updates to group stop participation. This patch doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:10 +02:00
Tejun Heo	755e276b33	ptrace: ptrace_check_attach(): rename @kill to @ignore_state and add comments PTRACE_INTERRUPT is going to be added which should also skip task_is_traced() check in ptrace_check_attach(). Rename @kill to @ignore_state and make it bool. Add function comment while at it. This patch doesn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:10 +02:00
Tejun Heo	a8f072c1d6	job control: rename signal->group_stop and flags to jobctl and update them signal->group_stop currently hosts mostly group stop related flags; however, it's gonna be used for wider purposes and the GROUP_STOP_ flag prefix becomes confusing. Rename signal->group_stop to signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_. Bit position macros JOBCTL__BIT are defined and JOBCTL_* flags are defined in terms of them to allow using bitops later. While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate future additions. This doesn't cause any functional change. -v2: JOBCTL_*_BIT macros added as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:09 +02:00
Tejun Heo	0b1007c357	ptrace: remove silly wait_trap variable from ptrace_attach() Remove local variable wait_trap which determines whether to wait for !TRAPPING or not and simply wait for it if attach was successful. -v2: Oleg pointed out wait should happen iff attach was successful. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-06-04 18:17:09 +02:00
Ingo Molnar	3ce2a0bc9d	Merge branch 'perf/urgent' into perf/core Conflicts: tools/perf/util/python.c Merge reason: resolve the conflict with perf/urgent. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-04 12:28:05 +02:00
Vince Weaver	aa4a221875	perf: Comment /proc/sys/kernel/perf_event_paranoid to be part of user ABI Turns out that distro packages use this file as an indicator of the perf event subsystem - this is easier to check for from scripts than the existence of the system call. This is easy enough to keep around for the kernel, so add a comment to make sure it stays so. Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: acme@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1106031751170.29381@cl320.eecs.utk.edu Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-04 12:22:04 +02:00
Ingo Molnar	710054ba25	Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent	2011-06-04 12:13:06 +02:00
Sebastian Andrzej Siewior	1c3cc11602	timers: Consider slack value in mod_timer() There is an optimization which does not update the timer if the timer was pending and the expiration time was unchanged. Since commit `3bbb9ec9` ("timers: Introduce the concept of timer slack for legacy timers") this optimization is no longer applied for timers where the expiration time got extended due to the slack value. So we need to check again after the expiration time might have been updated. [ tglx: Made it a single check by applying slack first and sorting out the slack = 0 value (all timeouts < 256 jiffies) early ] Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Link: http://lkml.kernel.org/r/20110521105828.GA29442@Chamillionaire.breakpoint.cc Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-06-03 15:02:32 +02:00
Mark Brown	c5182b8867	genirq: Ensure we locate the passed IRQ in irq_alloc_descs() When irq_alloc_descs() is called with no base IRQ specified then it will search for a range of IRQs starting from a specified base address. In the case where an IRQ is specified it still does this search in order to ensure that none of the requested range is already allocated and it still uses the from parameter to specify the base for the search. This means that in the case where a base is specified but from is zero (which is reasonable as any IRQ number is in the range specified by a zero from) the function will get confused and try to allocate the first suitably sized block of free IRQs it finds. Instead use a specified IRQ as the base address for the search, and insist that any from that is specified can support that IRQ. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Link: http://lkml.kernel.org/r/1307037313-15733-1-git-send-email-broonie@opensource.wolfsonmicro.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-06-03 14:53:16 +02:00
Linus Walleij	e7fbad300a	genirq: Fix descriptor init on non-sparse IRQs The genirq changes are initializing descriptors for sparse IRQs quite differently from how non-sparse (stacked?) IRQs are initialized, with the effect that on my platform all IRQs are default-disabled on sparse IRQs and default-enabled if non-sparse IRQs are used, crashing some GPIO driver. Fix this by refactoring the non-sparse IRQs to use the same descriptor init function as the sparse IRQs. Signed-off: Linus Walleij <linus.walleij@linaro.org> Link: http://lkml.kernel.org/r/1306858479-16622-1-git-send-email-linus.walleij@stericsson.com Cc: stable@kernel.org # 2.6.39 Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-06-03 14:53:16 +02:00
Sebastian Andrzej Siewior	3a43e05f4d	irq: Handle spurios irq detection for threaded irqs The detection of spurios interrupts is currently limited to first level handler. In force-threaded mode we never notice if the threaded irq does not feel responsible. This patch catches the return value of the threaded handler and forwards it to the spurious detector. If the primary handler returns only IRQ_WAKE_THREAD then the spourious detector ignores it because it gets called again from the threaded handler. [ tglx: Report the erroneous return value early and bail out ] Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Link: http://lkml.kernel.org/r/1306824972-27067-2-git-send-email-sebastian@breakpoint.cc Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-06-03 14:53:15 +02:00
Sebastian Andrzej Siewior	ef26f20cd1	genirq: Print threaded handler in spurious debug output In forced threaded mode (or with an explicit threaded handler) we only see the primary handler, but not the threaded handler. Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Link: http://lkml.kernel.org/r/1306824972-27067-1-git-send-email-sebastian@breakpoint.cc Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-06-03 14:53:15 +02:00
Thomas Gleixner	1b054b67d3	clockevents: Handle empty cpumask gracefully For UP it's stupid to request an initialized cpumask for the clock event devices. Though we need the mask set even on UP to avoid a horrible ifdeffery especially in the broadcast code. For SMP we can at least try to survive with a warning and set the cpumask of the cpu we're running on. That gives a decent chance to bring the machine up and retrieve the debug info. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Walleij <linus.walleij@linaro.org Cc: Lee Jones <lee.jones@linaro.org> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Stephen Boyd <sboyd@codeaurora.org>	2011-06-03 11:13:33 +02:00
Ingo Molnar	27eb4a1e4a	Merge commit 'v3.0-rc1' into perf/core Merge reason: merge in the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-06-03 10:41:08 +02:00
Ingo Molnar	e197f094b7	Merge branch 'unlikely/sched' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into sched/urgent	2011-06-03 10:27:47 +02:00
Peter Zijlstra	74c355fbdf	perf, cgroups: Fix up for new API Ben changed the cgroup API in commit `f780bdb7c1` (cgroups: add per-thread subsystem callbacks) in an incompatible way, but forgot to convert the perf cgroup bits. Avoid compile warnings and runtime splats and convert perf too ;-) Acked-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1306767651.1200.2990.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-31 14:20:25 +02:00
Peter Zijlstra	f339b9dc1f	sched: Fix schedstat.nr_wakeups_migrate While looking over the code I found that with the ttwu rework the nr_wakeups_migrate test broke since we now switch cpus prior to calling ttwu_stat(), hence the test is always true. Cure this by passing the migration state in wake_flags. Also move the whole test under CONFIG_SMP, its hard to migrate tasks on UP :-) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-pwwxl7gdqs5676f1d4cx6pj7@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-31 14:19:57 +02:00
Peter Zijlstra	f01114cb59	sched: Fix cross-cpu clock sync on remote wakeups Markus reported that commit `317f394160` ("sched: Move the second half of ttwu() to the remote cpu") caused some accounting funnies on his AMD Phenom II X4, such as weird 'top' results. It turns out that this is due to non-synced TSC and the queued remote wakeups stopped coupeling the two relevant cpu clocks, which leads to wakeups seeing time jumps, which in turn lead to skewed runtime stats. Add an explicit call to sched_clock_cpu() to couple the per-cpu clocks to restore the normal flow of time. Reported-and-tested-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1306835745.2353.3.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-31 14:19:56 +02:00
Peter Zijlstra	d72bce0e67	rcu: Cure load woes Commit `cc3ce5176d` (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state) fudges a sleeping task' state, resulting in the scheduler seeing a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE task waking up. The result is unbalanced load calculation. The problem that patch tried to address is that the RCU threads could stay in UNINTERRUPTIBLE state for quite a while and triggering the hung task detector due to on-demand wake-ups. Cure the problem differently by always giving the tasks at least one wake-up once the CPU is fully up and running, this will kick them out of the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE wait state. [ The alternative would be teaching kthread_create() to start threads as INTERRUPTIBLE but that needs a tad more thought. ] Reported-by: Damien Wyart <damien.wyart@free.fr> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul E. McKenney <paul.mckenney@linaro.org> Link: http://lkml.kernel.org/r/1306755291.1200.2872.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-31 10:01:48 +02:00
Linus Torvalds	6345d24daf	mm: Fix boot crash in mm_alloc() Thomas Gleixner reports that we now have a boot crash triggered by CONFIG_CPUMASK_OFFSTACK=y: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<c11ae035>] find_next_bit+0x55/0xb0 Call Trace: [<c11addda>] cpumask_any_but+0x2a/0x70 [<c102396b>] flush_tlb_mm+0x2b/0x80 [<c1022705>] pud_populate+0x35/0x50 [<c10227ba>] pgd_alloc+0x9a/0xf0 [<c103a3fc>] mm_init+0xec/0x120 [<c103a7a3>] mm_alloc+0x53/0xd0 which was introduced by commit `de03c72cfc` ("mm: convert mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of mm_init() vs mm_init_cpumask Thomas wrote a patch to just fix the ordering of initialization, but I hate the new double allocation in the fork path, so I ended up instead doing some more radical surgery to clean it all up. Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Ingo Molnar <mingo@elte.hu> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-29 11:32:28 -07:00
Linus Torvalds	f310642123	Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6: x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param x86 idle: deprecate "no-hlt" cmdline param x86 idle APM: deprecate CONFIG_APM_CPU_IDLE x86 idle floppy: deprecate disable_hlt() x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it x86 idle: clarify AMD erratum 400 workaround idle governor: Avoid lock acquisition to read pm_qos before entering idle cpuidle: menu: fixed wrapping timers at 4.294 seconds	2011-05-29 11:18:09 -07:00
Tim Chen	333c5ae994	idle governor: Avoid lock acquisition to read pm_qos before entering idle Thanks to the reviews and comments by Rafael, James, Mark and Andi. Here's version 2 of the patch incorporating your comments and also some update to my previous patch comments. I noticed that before entering idle state, the menu idle governor will look up the current pm_qos target value according to the list of qos requests received. This look up currently needs the acquisition of a lock to access the list of qos requests to find the qos target value, slowing down the entrance into idle state due to contention by multiple cpus to access this list. The contention is severe when there are a lot of cpus waking and going into idle. For example, for a simple workload that has 32 pair of processes ping ponging messages to each other, where 64 cpu cores are active in test system, I see the following profile with 37.82% of cpu cycles spent in contention of pm_qos_lock: - 37.82% swapper [kernel.kallsyms] [k] _raw_spin_lock_irqsave - _raw_spin_lock_irqsave - 95.65% pm_qos_request menu_select cpuidle_idle_call - cpu_idle 99.98% start_secondary A better approach will be to cache the updated pm_qos target value so reading it does not require lock acquisition as in the patch below. With this patch the contention for pm_qos_lock is removed and I saw a 2.2X increase in throughput for my message passing workload. cc: stable@kernel.org Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: James Bottomley <James.Bottomley@suse.de> Acked-by: mark gross <markgross@thegnar.org> Signed-off-by: Len Brown <len.brown@intel.com>	2011-05-29 00:50:59 -04:00
Linus Torvalds	08a8b79600	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed sched: Fix ->min_vruntime calculation in dequeue_entity() sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW sched: More sched_domain iterations fixes	2011-05-28 12:56:46 -07:00
Linus Torvalds	1ba4b8cb94	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state rcu: Remove waitqueue usage for cpu, node, and boost kthreads rcu: Avoid acquiring rcu_node locks in timer functions atomic: Add atomic_or() Documentation: Add statistics about nested locks rcu: Decrease memory-barrier usage based on semi-formal proof rcu: Make rcu_enter_nohz() pay attention to nesting rcu: Don't do reschedule unless in irq rcu: Remove old memory barriers from rcu_process_callbacks() rcu: Add memory barriers rcu: Fix unpaired rcu_irq_enter() from locking selftests	2011-05-28 12:56:32 -07:00
Linus Torvalds	c4a227d89f	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) perf: Fix SIGIO handling perf top: Don't stop if no kernel symtab is found perf top: Handle kptr_restrict perf top: Remove unused macro perf events: initialize fd array to -1 instead of 0 perf tools: Make sure kptr_restrict warnings fit 80 col terms perf tools: Fix build on older systems perf symbols: Handle /proc/sys/kernel/kptr_restrict perf: Remove duplicate headers ftrace: Add internal recursive checks tracing: Update btrfs's tracepoints to use u64 interface tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine ftrace: Set ops->flag to enabled even on static function tracing tracing: Have event with function tracer check error return ftrace: Have ftrace_startup() return failure code jump_label: Check entries limit in __jump_label_update ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM scripts/tags.sh: Add magic for trace-events for etags too scripts/tags.sh: Fix ctags for DEFINE_EVENT() x86/ftrace: Fix compiler warning in ftrace.c ...	2011-05-28 12:55:55 -07:00
Peter Zijlstra	64ce312618	perf: De-schedule a task context when removing the last event Since perf_install_in_context() will now install a context when we add the first event, we can de-schedule the context when the last event is removed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192142.090431763@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:23 +02:00
Peter Zijlstra	e03a9a55b4	perf: Change close() semantics for group events In order to always call list_del_event() on the correct cpu if the event is part of an active context and avoid having to do two IPIs, change the close() semantics slightly. The current perf_event_disable() call would disable a whole group if the event that's being closed is the group leader, whereas the new code keeps the group siblings enabled. People should not rely on this behaviour and I don't think they do, but in case we find they do, the fix is easy and we have to take the double IPI cost. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vince Weaver <vweaver1@eecs.utk.edu> Link: http://lkml.kernel.org/r/20110409192142.038377551@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:21 +02:00
Peter Zijlstra	dce5855bba	perf: Collect the schedule-in rules in one function This was scattered out - refactor it into a single function. No change in functionality. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192141.979862055@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:19 +02:00
Peter Zijlstra	db24d33e08	perf: Change and simplify ctx::is_active semantics Instead of tracking if a context is active or not, track which events of the context are active. By making it a bitmask of EVENT_PINNED\|EVENT_FLEXIBLE we can simplify some of the scheduling routines since it can avoid adding events that are already active. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192141.930282378@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:17 +02:00
Peter Zijlstra	2c29ef0fef	perf: Simplify and fix __perf_install_in_context() Currently __perf_install_in_context() will try and schedule in the event irrespective of our event scheduling rules, that is, we try to schedule CPU-pinned, TASK-pinned, CPU-flexible, TASK-flexible, but when creating a new event we simply try and schedule it on top of whatever is already on the PMU, this can lead to errors for pinned events. Therefore, simplify things and simply schedule everything out, add the event to the corresponding context and schedule everything back in. This also nicely handles the case where with __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI can come right in the middle of schedule, before we managed to call perf_event_task_sched_in(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192141.870894224@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:16 +02:00
Peter Zijlstra	04dc2dbbfe	perf: Remove task_ctx_sched_in() Make task_ctx_sched_*() imply EVENT_ALL, since anything less will not actually have scheduled the task in/out at all. Since there's no site that schedules all of a task in (due to the interleave with flexible cpuctx) we can remove this function. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192141.817893268@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:14 +02:00
Peter Zijlstra	facc43071c	perf: Optimize event scheduling locking Currently we only hold one ctx->lock at a time, which results in us flipping back and forth between cpuctx->ctx.lock and task_ctx->lock. Avoid this and gain large atomic regions by holding both locks. We nest the task lock inside the cpu lock, since with task scheduling we might have to change task ctx while holding the cpu ctx lock. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192141.769881865@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:12 +02:00
Peter Zijlstra	9137fb28ac	perf: Clean up 'ctx' reference counting Small cleanup to how we refcount in find_get_context(), this also allows us to use put_ctx() to free things instead of using kfree(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192141.719340481@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:10 +02:00
Peter Zijlstra	075e0b0085	perf: Optimize ctx_sched_out() Oleg noted that ctx_sched_out() disables the PMU even though it might not actually do something, avoid needless PMU-disabling. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110409192141.665385503@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 18:01:09 +02:00
Paul E. McKenney	cc3ce5176d	rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can result in softlockup warnings. Because some of RCU's kthreads can legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE state in order to avoid those warnings. Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:41:56 +02:00
Peter Zijlstra	08bca60a69	rcu: Remove waitqueue usage for cpu, node, and boost kthreads It is not necessary to use waitqueues for the RCU kthreads because we always know exactly which thread is to be awakened. In addition, wake_up() only issues an actual wakeup when there is a thread waiting on the queue, which was why there was an extra explicit wake_up_process() to get the RCU kthreads started. Eliminating the waitqueues (and wake_up()) in favor of wake_up_process() eliminates the need for the initial wake_up_process() and also shrinks the data structure size a bit. The wakeup logic is placed in a new rcu_wait() macro. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:41:52 +02:00
Paul E. McKenney	8826f3b039	rcu: Avoid acquiring rcu_node locks in timer functions This commit switches manipulations of the rcu_node ->wakemask field to atomic operations, which allows rcu_cpu_kthread_timer() to avoid acquiring the rcu_node lock. This should avoid the following lockdep splat reported by Valdis Kletnieks: [ 12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd [ 12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513 [ 12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [ 12.987691] hub 1-4:1.0: USB hub found [ 12.987877] hub 1-4:1.0: 3 ports detected [ 12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10 [ 13.071471] udevadm used greatest stack depth: 3984 bytes left [ 13.172129] [ 13.172130] ======================================================= [ 13.172425] [ INFO: possible circular locking dependency detected ] [ 13.172650] 2.6.39-rc6-mmotm0506 #1 [ 13.172773] ------------------------------------------------------- [ 13.172997] blkid/267 is trying to acquire lock: [ 13.173009] (&p->pi_lock){-.-.-.}, at: [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa [ 13.173009] [ 13.173009] but task is already holding lock: [ 13.173009] (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58 [ 13.173009] [ 13.173009] which lock already depends on the new lock. [ 13.173009] [ 13.173009] [ 13.173009] the existing dependency chain (in reverse order) is: [ 13.173009] [ 13.173009] -> #2 (rcu_node_level_0){..-...}: [ 13.173009] [<ffffffff810679b9>] check_prevs_add+0x8b/0x104 [ 13.173009] [<ffffffff81067da1>] validate_chain+0x36f/0x3ab [ 13.173009] [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2 [ 13.173009] [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c [ 13.173009] [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45 [ 13.173009] [<ffffffff81090794>] rcu_read_unlock_special+0x8c/0x1d5 [ 13.173009] [<ffffffff8109092c>] __rcu_read_unlock+0x4f/0xd7 [ 13.173009] [<ffffffff81027bd3>] rcu_read_unlock+0x21/0x23 [ 13.173009] [<ffffffff8102cc34>] cpuacct_charge+0x6c/0x75 [ 13.173009] [<ffffffff81030cc6>] update_curr+0x101/0x12e [ 13.173009] [<ffffffff810311d0>] check_preempt_wakeup+0xf7/0x23b [ 13.173009] [<ffffffff8102acb3>] check_preempt_curr+0x2b/0x68 [ 13.173009] [<ffffffff81031d40>] ttwu_do_wakeup+0x76/0x128 [ 13.173009] [<ffffffff81031e49>] ttwu_do_activate.constprop.63+0x57/0x5c [ 13.173009] [<ffffffff81031e96>] scheduler_ipi+0x48/0x5d [ 13.173009] [<ffffffff810177d5>] smp_reschedule_interrupt+0x16/0x18 [ 13.173009] [<ffffffff815710f3>] reschedule_interrupt+0x13/0x20 [ 13.173009] [<ffffffff810b66d1>] rcu_read_unlock+0x21/0x23 [ 13.173009] [<ffffffff810b739c>] find_get_page+0xa9/0xb9 [ 13.173009] [<ffffffff810b8b48>] filemap_fault+0x6a/0x34d [ 13.173009] [<ffffffff810d1a25>] __do_fault+0x54/0x3e6 [ 13.173009] [<ffffffff810d447a>] handle_pte_fault+0x12c/0x1ed [ 13.173009] [<ffffffff810d48f7>] handle_mm_fault+0x1cd/0x1e0 [ 13.173009] [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de [ 13.173009] [<ffffffff8156a75f>] page_fault+0x1f/0x30 [ 13.173009] [ 13.173009] -> #1 (&rq->lock){-.-.-.}: [ 13.173009] [<ffffffff810679b9>] check_prevs_add+0x8b/0x104 [ 13.173009] [<ffffffff81067da1>] validate_chain+0x36f/0x3ab [ 13.173009] [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2 [ 13.173009] [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c [ 13.173009] [<ffffffff815697f1>] _raw_spin_lock+0x36/0x45 [ 13.173009] [<ffffffff81027e19>] __task_rq_lock+0x8b/0xd3 [ 13.173009] [<ffffffff81032f7f>] wake_up_new_task+0x41/0x108 [ 13.173009] [<ffffffff810376c3>] do_fork+0x265/0x33f [ 13.173009] [<ffffffff81007d02>] kernel_thread+0x6b/0x6d [ 13.173009] [<ffffffff8153a9dd>] rest_init+0x21/0xd2 [ 13.173009] [<ffffffff81b1db4f>] start_kernel+0x3bb/0x3c6 [ 13.173009] [<ffffffff81b1d29f>] x86_64_start_reservations+0xaf/0xb3 [ 13.173009] [<ffffffff81b1d393>] x86_64_start_kernel+0xf0/0xf7 [ 13.173009] [ 13.173009] -> #0 (&p->pi_lock){-.-.-.}: [ 13.173009] [<ffffffff81067788>] check_prev_add+0x68/0x20e [ 13.173009] [<ffffffff810679b9>] check_prevs_add+0x8b/0x104 [ 13.173009] [<ffffffff81067da1>] validate_chain+0x36f/0x3ab [ 13.173009] [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2 [ 13.173009] [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c [ 13.173009] [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57 [ 13.173009] [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa [ 13.173009] [<ffffffff81032f3c>] wake_up_process+0x10/0x12 [ 13.173009] [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58 [ 13.173009] [<ffffffff81045286>] call_timer_fn+0xac/0x1e9 [ 13.173009] [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2 [ 13.173009] [<ffffffff8103e487>] __do_softirq+0x109/0x26a [ 13.173009] [<ffffffff8157144c>] call_softirq+0x1c/0x30 [ 13.173009] [<ffffffff81003207>] do_softirq+0x44/0xf1 [ 13.173009] [<ffffffff8103e8b9>] irq_exit+0x58/0xc8 [ 13.173009] [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87 [ 13.173009] [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20 [ 13.173009] [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310 [ 13.173009] [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243 [ 13.173009] [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a [ 13.173009] [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b [ 13.173009] [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0 [ 13.173009] [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de [ 13.173009] [<ffffffff8156a75f>] page_fault+0x1f/0x30 [ 13.173009] [ 13.173009] other info that might help us debug this: [ 13.173009] [ 13.173009] Chain exists of: [ 13.173009] &p->pi_lock --> &rq->lock --> rcu_node_level_0 [ 13.173009] [ 13.173009] Possible unsafe locking scenario: [ 13.173009] [ 13.173009] CPU0 CPU1 [ 13.173009] ---- ---- [ 13.173009] lock(rcu_node_level_0); [ 13.173009] lock(&rq->lock); [ 13.173009] lock(rcu_node_level_0); [ 13.173009] lock(&p->pi_lock); [ 13.173009] [ 13.173009] * DEADLOCK * [ 13.173009] [ 13.173009] 3 locks held by blkid/267: [ 13.173009] #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8156cdb4>] do_page_fault+0x1f3/0x5de [ 13.173009] #1: (&yield_timer){+.-...}, at: [<ffffffff810451da>] call_timer_fn+0x0/0x1e9 [ 13.173009] #2: (rcu_node_level_0){..-...}, at: [<ffffffff810901cc>] rcu_cpu_kthread_timer+0x27/0x58 [ 13.173009] [ 13.173009] stack backtrace: [ 13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1 [ 13.173009] Call Trace: [ 13.173009] <IRQ> [<ffffffff8154a529>] print_circular_bug+0xc8/0xd9 [ 13.173009] [<ffffffff81067788>] check_prev_add+0x68/0x20e [ 13.173009] [<ffffffff8100c861>] ? save_stack_trace+0x28/0x46 [ 13.173009] [<ffffffff810679b9>] check_prevs_add+0x8b/0x104 [ 13.173009] [<ffffffff81067da1>] validate_chain+0x36f/0x3ab [ 13.173009] [<ffffffff8106846b>] __lock_acquire+0x369/0x3e2 [ 13.173009] [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa [ 13.173009] [<ffffffff81068a0f>] lock_acquire+0xfc/0x14c [ 13.173009] [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa [ 13.173009] [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82 [ 13.173009] [<ffffffff815698ea>] _raw_spin_lock_irqsave+0x44/0x57 [ 13.173009] [<ffffffff81032d8f>] ? try_to_wake_up+0x29/0x1aa [ 13.173009] [<ffffffff81032d8f>] try_to_wake_up+0x29/0x1aa [ 13.173009] [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82 [ 13.173009] [<ffffffff81032f3c>] wake_up_process+0x10/0x12 [ 13.173009] [<ffffffff810901e9>] rcu_cpu_kthread_timer+0x44/0x58 [ 13.173009] [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82 [ 13.173009] [<ffffffff81045286>] call_timer_fn+0xac/0x1e9 [ 13.173009] [<ffffffff810451da>] ? del_timer+0x75/0x75 [ 13.173009] [<ffffffff810901a5>] ? rcu_check_quiescent_state+0x82/0x82 [ 13.173009] [<ffffffff8104556d>] run_timer_softirq+0x1aa/0x1f2 [ 13.173009] [<ffffffff8103e487>] __do_softirq+0x109/0x26a [ 13.173009] [<ffffffff8106365f>] ? tick_dev_program_event+0x37/0xf6 [ 13.173009] [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f [ 13.173009] [<ffffffff8157144c>] call_softirq+0x1c/0x30 [ 13.173009] [<ffffffff81003207>] do_softirq+0x44/0xf1 [ 13.173009] [<ffffffff8103e8b9>] irq_exit+0x58/0xc8 [ 13.173009] [<ffffffff81017f5a>] smp_apic_timer_interrupt+0x79/0x87 [ 13.173009] [<ffffffff81570fd3>] apic_timer_interrupt+0x13/0x20 [ 13.173009] <EOI> [<ffffffff810bd384>] ? get_page_from_freelist+0x114/0x310 [ 13.173009] [<ffffffff810bd51a>] ? get_page_from_freelist+0x2aa/0x310 [ 13.173009] [<ffffffff812220e7>] ? clear_page_c+0x7/0x10 [ 13.173009] [<ffffffff810bd1ef>] ? prep_new_page+0x14c/0x1cd [ 13.173009] [<ffffffff810bd51a>] get_page_from_freelist+0x2aa/0x310 [ 13.173009] [<ffffffff810bdf03>] __alloc_pages_nodemask+0x178/0x243 [ 13.173009] [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99 [ 13.173009] [<ffffffff8101fe2f>] pte_alloc_one+0x1e/0x3a [ 13.173009] [<ffffffff810d46b9>] ? __pmd_alloc+0x87/0x99 [ 13.173009] [<ffffffff810d27fe>] __pte_alloc+0x22/0x14b [ 13.173009] [<ffffffff810d48a8>] handle_mm_fault+0x17e/0x1e0 [ 13.173009] [<ffffffff8156cfee>] do_page_fault+0x42d/0x5de [ 13.173009] [<ffffffff810d915f>] ? sys_brk+0x32/0x10c [ 13.173009] [<ffffffff810a0e4a>] ? time_hardirqs_off+0x1b/0x2f [ 13.173009] [<ffffffff81065c4f>] ? trace_hardirqs_off_caller+0x3f/0x9c [ 13.173009] [<ffffffff812235dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 13.173009] [<ffffffff8156a75f>] page_fault+0x1f/0x30 [ 14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:41:49 +02:00
Ingo Molnar	29f742f88a	Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent	2011-05-28 17:41:05 +02:00
Peter Zijlstra	f506b3dc0e	perf: Fix SIGIO handling Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So explicitly push the wakeup (including signals) when requested. Reported-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:04:59 +02:00
KOSAKI Motohiro	1e1b6c511d	cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed The rule is, we have to update tsk->rt.nr_cpus_allowed if we change tsk->cpus_allowed. Otherwise RT scheduler may confuse. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:02:57 +02:00
Peter Zijlstra	1e87623178	sched: Fix ->min_vruntime calculation in dequeue_entity() Dima Zavin <dima@android.com> reported: "After pulling the thread off the run-queue during a cgroup change, the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime then gets normalized to this new value. This can then lead to the thread getting an unfair boost in the new group if the vruntime of the next task in the old run-queue was way further ahead." Reported-by: Dima Zavin <dima@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Recalls-having-tested-once-upon-a-time-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:02:56 +02:00
Peter Zijlstra	d6aa8f85f1	sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW Marc reported that `e4a52bcb9` (sched: Remove rq->lock from the first half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few __ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu() code was suspect. Yong found that the interrupt could hit after context_switch() changes current but before it clears p->on_cpu, if that interrupt were to attempt a wake-up of p we would indeed find ourselves spinning in IRQ context. Fix this by reverting to the old behaviour for this situation and perform a full remote wake-up. Cc: Frank Rowand <frank.rowand@am.sony.com> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Reported-by: Marc Zyngier <Marc.Zyngier@arm.com> Tested-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:02:55 +02:00
Xiaotian Feng	cd4ae6adf8	sched: More sched_domain iterations fixes sched_domain iterations needs to be protected by rcu_read_lock() now, this patch adds another two places which needs the rcu lock, which is spotted by following suspicious rcu_dereference_check() usage warnings. kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection! kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection! Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-28 17:02:54 +02:00
Linus Torvalds	f23a5e1405	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Fix PM QOS's user mode interface to work with ASCII input PM / Hibernate: Update kerneldoc comments in hibernate.c PM / Hibernate: Remove arch_prepare_suspend() PM / Hibernate: Update some comments in core hibernate code	2011-05-27 14:27:34 -07:00
Linus Torvalds	e52e713ec3	Merge branch 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs * 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs: Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/ to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/<moved_file> to Documentation/security/<moved_file>.	2011-05-27 10:25:02 -07:00
Ingo Molnar	d6a72fe465	Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent	2011-05-27 14:28:09 +02:00
Rakib Mullick	6f7bd76f05	kernel/profile.c: remove some duplicate code from profile_hits() profile_hits() has a common check for prof_on and prof_buffer regardless of SMP or !SMP. So, remove some duplicate code by splitting profile_hits into two. [akpm@linux-foundation.org: make do_profile_hits static] Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:37 -07:00
Jiri Slaby	3864601387	mm: extract exe_file handling from procfs Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/. This was because exe_file was needed only for /proc/<pid>/exe. Since we will need the exe_file functionality also for core dumps (so core name can contain full binary path), built this functionality always into the kernel. To achieve that move that out of proc FS to the kernel/ where in fact it should belong. By doing that we can make dup_mm_exe_file static. Also we can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:36 -07:00
Daniel Lezcano	a77aea9201	cgroup: remove the ns_cgroup The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and leads to some problems: * cgroup creation is out-of-control * cgroup name can conflict when pids are looping * it is not possible to have a single process handling a lot of namespaces without falling in a exponential creation time * we may want to create a namespace without creating a cgroup The ns_cgroup was replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file. This patch removes the ns_cgroup as suggested in the following thread: https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html The 'cgroup_clone' function is removed because it is no longer used. This is a userspace-visible change. Commit `45531757b4` ("cgroup: notify ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a printk warning users that the feature is planned for removal. Since that time we have heard from XXX users who were affected by this. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:34 -07:00
Ben Blum	d846687d7f	cgroups: use flex_array in attach_proc Convert cgroup_attach_proc to use flex_array. The cgroup_attach_proc implementation requires a pre-allocated array to store task pointers to atomically move a thread-group, but asking for a monolithic array with kmalloc() may be unreliable for very large groups. Using flex_array provides the same functionality with less risk of failure. This is a post-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:34 -07:00
Ben Blum	74a1166dfe	cgroups: make procs file writable Make procs file writable to move all threads by tgid at once. Add functionality that enables users to move all threads in a threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This current implementation makes use of a per-threadgroup rwsem that's taken for reading in the fork() path to prevent newly forking threads within the threadgroup from "escaping" while the move is in progress. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:34 -07:00
Ben Blum	f780bdb7c1	cgroups: add per-thread subsystem callbacks Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:34 -07:00
Ben Blum	4714d1d32d	cgroups: read-write lock CLONE_THREAD forking per threadgroup Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup Add an rwsem that lives in a threadgroup's signal_struct that's taken for reading in the fork path, under CONFIG_CGROUPS. If another part of the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that CGROUPS and the other system would both depend on. This is a pre-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-26 17:12:34 -07:00
Rafael J. Wysocki	0775a60aca	PM: Fix PM QOS's user mode interface to work with ASCII input Make pm_qos_power_write() accept values passed to it in the ASCII hex format either with or without an ending newline. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Mark Gross <markgross@thegnar.org>	2011-05-27 00:05:23 +02:00
Paul E. McKenney	23b5c8fa01	rcu: Decrease memory-barrier usage based on semi-formal proof (Note: this was reverted, and is now being re-applied in pieces, with this being the fifth and final piece. See below for the reason that it is now felt to be safe to re-apply this.) Commit `d09b62d` fixed grace-period synchronization, but left some smp_mb() invocations in rcu_process_callbacks() that are no longer needed, but sheer paranoia prevented them from being removed. This commit removes them and provides a proof of correctness in their absence. It also adds a memory barrier to rcu_report_qs_rsp() immediately before the update to rsp->completed in order to handle the theoretical possibility that the compiler or CPU might move massive quantities of code into a lock-based critical section. This also proves that the sheer paranoia was not entirely unjustified, at least from a theoretical point of view. In addition, the old dyntick-idle synchronization depended on the fact that grace periods were many milliseconds in duration, so that it could be assumed that no dyntick-idle CPU could reorder a memory reference across an entire grace period. Unfortunately for this design, the addition of expedited grace periods breaks this assumption, which has the unfortunate side-effect of requiring atomic operations in the functions that track dyntick-idle state for RCU. (There is some hope that the algorithms used in user-level RCU might be applied here, but some work is required to handle the NMIs that user-space applications can happily ignore. For the short term, better safe than sorry.) This proof assumes that neither compiler nor CPU will allow a lock acquisition and release to be reordered, as doing so can result in deadlock. The proof is as follows: 1. A given CPU declares a quiescent state under the protection of its leaf rcu_node's lock. 2. If there is more than one level of rcu_node hierarchy, the last CPU to declare a quiescent state will also acquire the ->lock of the next rcu_node up in the hierarchy, but only after releasing the lower level's lock. The acquisition of this lock clearly cannot occur prior to the acquisition of the leaf node's lock. 3. Step 2 repeats until we reach the root rcu_node structure. Please note again that only one lock is held at a time through this process. The acquisition of the root rcu_node's ->lock must occur after the release of that of the leaf rcu_node. 4. At this point, we set the ->completed field in the rcu_state structure in rcu_report_qs_rsp(). However, if the rcu_node hierarchy contains only one rcu_node, then in theory the code preceding the quiescent state could leak into the critical section. We therefore precede the update of ->completed with a memory barrier. All CPUs will therefore agree that any updates preceding any report of a quiescent state will have happened before the update of ->completed. 5. Regardless of whether a new grace period is needed, rcu_start_gp() will propagate the new value of ->completed to all of the leaf rcu_node structures, under the protection of each rcu_node's ->lock. If a new grace period is needed immediately, this propagation will occur in the same critical section that ->completed was set in, but courtesy of the memory barrier in #4 above, is still seen to follow any pre-quiescent-state activity. 6. When a given CPU invokes __rcu_process_gp_end(), it becomes aware of the end of the old grace period and therefore makes any RCU callbacks that were waiting on that grace period eligible for invocation. If this CPU is the same one that detected the end of the grace period, and if there is but a single rcu_node in the hierarchy, we will still be in the single critical section. In this case, the memory barrier in step #4 guarantees that all callbacks will be seen to execute after each CPU's quiescent state. On the other hand, if this is a different CPU, it will acquire the leaf rcu_node's ->lock, and will again be serialized after each CPU's quiescent state for the old grace period. On the strength of this proof, this commit therefore removes the memory barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp(). The effect is to reduce the number of memory barriers by one and to reduce the frequency of execution from about once per scheduling tick per CPU to once per grace period. This was reverted do to hangs found during testing by Yinghai Lu and Ingo Molnar. Frederic Weisbecker supplied Yinghai with tracing that located the underlying problem, and Frederic also provided the fix. The underlying problem was that the HARDIRQ_ENTER() macro from lib/locking-selftest.c invoked irq_enter(), which in turn invokes rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which does not invoke rcu_irq_exit(). This situation resulted in calls to rcu_irq_enter() that were not balanced by the required calls to rcu_irq_exit(). Therefore, after these locking selftests completed, RCU's dyntick-idle nesting count was a large number (for example, 72), which caused RCU to to conclude that the affected CPU was not in dyntick-idle mode when in fact it was. RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting in hangs. In contrast, with Frederic's patch, which replaces the irq_enter() in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU running the test is already marked as not being in dyntick-idle mode. This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU then has no problem working out which CPUs are in dyntick-idle mode and which are not. The reason that the imbalance was not noticed before the barrier patch was applied is that the old implementation of rcu_enter_nohz() ignored the nesting depth. This could still result in delays, but much shorter ones. Whenever there was a delay, RCU would IPI the CPU with the unbalanced nesting level, which would eventually result in rcu_enter_nohz() being called, which in turn would force RCU to see that the CPU was in dyntick-idle mode. The reason that very few people noticed the problem is that the mismatched irq_enter() vs. __irq_exit() occured only when the kernel was built with CONFIG_DEBUG_LOCKING_API_SELFTESTS. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-26 09:42:23 -07:00
Paul E. McKenney	4305ce7894	rcu: Make rcu_enter_nohz() pay attention to nesting The old version of rcu_enter_nohz() forced RCU into nohz mode even if the nesting count was non-zero. This change causes rcu_enter_nohz() to hold off for non-zero nesting counts. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-26 09:42:22 -07:00
Paul E. McKenney	b5904090c7	rcu: Don't do reschedule unless in irq Condition the set_need_resched() in rcu_irq_exit() on in_irq(). This should be a no-op, because rcu_irq_exit() should only be called from irq. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-26 09:42:21 -07:00
Paul E. McKenney	1135633bdd	rcu: Remove old memory barriers from rcu_process_callbacks() Second step of partitioning of commit `e59fb3120b`. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-26 09:42:21 -07:00
Paul E. McKenney	0bbcc529fc	rcu: Add memory barriers Add the memory barriers added by `e59fb3120b`. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-26 09:42:20 -07:00
Ingo Molnar	1102c660dd	Merge branch 'linus' into perf/urgent Merge reason: Linus applied an overlapping commit: `5f2e8e2b0b`: kernel/watchdog.c: Use proper ANSI C prototypes So merge it in to make sure we can iterate the file without conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-26 13:48:39 +02:00
Yinghai Lu	def945eeb9	irq: Remove smp_affinity_list when unregister irq proc commit 4b06042(bitmap, irq: add smp_affinity_list interface to /proc/irq) causes the following warning: [ 274.239500] WARNING: at fs/proc/generic.c:850 remove_proc_entry+0x24c/0x27a() [ 274.251761] remove_proc_entry: removing non-empty directory 'irq/184', leaking at least 'smp_affinity_list' Remove the new file in the exit path. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Mike Travis <travis@sgi.com> Link: http://lkml.kernel.org/r/4DDDE094.6050505@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-26 13:15:28 +02:00
Steven Rostedt	b1cff0ad10	ftrace: Add internal recursive checks Witold reported a reboot caused by the selftests of the dynamic function tracer. He sent me a config and I used ktest to do a config_bisect on it (as my config did not cause the crash). It pointed out that the problem config was CONFIG_PROVE_RCU. What happened was that if multiple callbacks are attached to the function tracer, we iterate a list of callbacks. Because the list is managed by synchronize_sched() and preempt_disable, the access to the pointers uses rcu_dereference_raw(). When PROVE_RCU is enabled, the rcu_dereference_raw() calls some debugging functions, which happen to be traced. The tracing of the debug function would then call rcu_dereference_raw() which would then call the debug function and then... well you get the idea. I first wrote two different patches to solve this bug. 1) add a __rcu_dereference_raw() that would not do any checks. 2) add notrace to the offending debug functions. Both of these patches worked. Talking with Paul McKenney on IRC, he suggested to add recursion detection instead. This seemed to be a better solution, so I decided to implement it. As the task_struct already has a trace_recursion to detect recursion in the ring buffer, and that has a very small number it allows, I decided to use that same variable to add flags that can detect the recursion inside the infrastructure of the function tracer. I plan to change it so that the task struct bit can be checked in mcount, but as that requires changes to all archs, I will hold that off to the next merge window. Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.com Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-25 22:13:49 -04:00
liubo	2fc1b6f0d0	tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine Filesystem, like Btrfs, has some "ULL" macros, and when these macros are passed to tracepoints'__print_symbolic(), there will be 64->32 truncate WARNINGS during compiling on 32bit box. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Link: http://lkml.kernel.org/r/4DACE6E0.7000507@cn.fujitsu.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-25 22:13:44 -04:00
Steven Rostedt	3b6cfdb171	ftrace: Set ops->flag to enabled even on static function tracing When dynamic ftrace is not configured, the ops->flags still needs to have its FTRACE_OPS_FL_ENABLED bit set in ftrace_startup(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-25 22:13:42 -04:00
Steven Rostedt	17bb615ad4	tracing: Have event with function tracer check error return The self tests for event tracer does not check if the function tracing was successfully activated. It needs to before it continues the tests, otherwise the wrong errors may be reported. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-25 22:13:39 -04:00
Steven Rostedt	a1cd617359	ftrace: Have ftrace_startup() return failure code The register_ftrace_function() returns an error code on failure except if the call to ftrace_startup() fails. Add a error return to ftrace_startup() if it fails to start, allowing register_ftrace_funtion() to return a proper error value. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-25 22:13:37 -04:00
Linus Torvalds	14d74e0cab	Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd: net: fix get_net_ns_by_fd for !CONFIG_NET_NS ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry. ns: Declare sys_setns in syscalls.h net: Allow setting the network namespace by fd ns proc: Add support for the ipc namespace ns proc: Add support for the uts namespace ns proc: Add support for the network namespace. ns: Introduce the setns syscall ns: proc files for namespace naming policy.	2011-05-25 18:10:16 -07:00
Jiri Olsa	7cbc5b8d4a	jump_label: Check entries limit in __jump_label_update When iterating the jump_label entries array (core or modules), the __jump_label_update function peeks over the last entry. The reason is that the end of the for loop depends on the key value of the processed entry. Thus when going through the last array entry, we will touch the memory behind the array limit. This bug probably will never be triggered, since most likely the memory behind the jump_label entries will be accesable and the entry->key will be different than the expected value. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Jason Baron <jbaron@redhat.com> Link: http://lkml.kernel.org/r/20110510104346.GC1899@jolsa.brq.redhat.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-25 19:56:36 -04:00
Linus Torvalds	9720d75399	Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: signal: sys_pause() should check signal_pending() ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread	2011-05-25 16:53:14 -07:00
Linus Torvalds	0798b1dbfb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: (26 commits) arch/tile: prefer "tilepro" as the name of the 32-bit architecture compat: include aio_abi.h for aio_context_t arch/tile: cleanups for tilegx compat mode arch/tile: allocate PCI IRQs later in boot arch/tile: support signal "exception-trace" hook arch/tile: use better definitions of xchg() and cmpxchg() include/linux/compat.h: coding-style fixes tile: add an RTC driver for the Tilera hypervisor arch/tile: finish enabling support for TILE-Gx 64-bit chip compat: fixes to allow working with tile arch arch/tile: update defconfig file to something more useful tile: do_hardwall_trap: do not play with task->sighand tile: replace mm->cpu_vm_mask with mm_cpumask() tile,mn10300: add device parameter to dma_cache_sync() audit: support the "standard" <asm-generic/unistd.h> arch/tile: clarify flush_buffer()/finv_buffer() function names arch/tile: kernel-related cleanups from removing static page size arch/tile: various header improvements for building drivers arch/tile: disable GX prefetcher during cache flush arch/tile: tolerate disabling CONFIG_BLK_DEV_INITRD ...	2011-05-25 15:35:32 -07:00
Thomas Gleixner	90ff1f30c0	hrtimers: Fix typo causing erratic timers commit `9ec2690758` ("timerfd: Manage cancelable timers in timerfd") introduced a CONFIG_HIGHRES_TIMERS (should be CONFIG_HIGH_RES_TIMERS) typo, which caused applications depending on CLOCK_REALTIME timers to become sluggy due to the fact that the time base of the realtime timers was not updated when the wall clock time was set. This causes anything from 100% CPU use for some applications to odd delays and hickups. Reported-bisected-and-tested-by: Anca Emanuel <anca.emanuel@gmail.com> Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Fatfingered-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 15:31:58 -07:00
Oleg Nesterov	d92fcf0552	signal: sys_pause() should check signal_pending() ERESTART* is always wrong without TIF_SIGPENDING. Teach sys_pause() to handle the spurious wakeup correctly. Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-05-25 19:22:27 +02:00
Oleg Nesterov	0666fb51b1	ptrace: ptrace_resume() shouldn't wake up !TASK_TRACED thread It is not clear why ptrace_resume() does wake_up_process(). Unless the caller is PTRACE_KILL the tracee should be TASK_TRACED so we can use wake_up_state(__TASK_TRACED). If sys_ptrace() races with SIGKILL we do not need the extra and potentionally spurious wakeup. If the caller is PTRACE_KILL, wake_up_process() is even more wrong. The tracee can sleep in any state in any place, and if we have a buggy code which doesn't handle a spurious wakeup correctly PTRACE_KILL can be used to exploit it. For example: int main(void) { int child, status; child = fork(); if (!child) { int ret; assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); ret = pause(); printf("pause: %d %m\n", ret); return 0x23; } sleep(1); assert(ptrace(PTRACE_KILL, child, 0,0) == 0); assert(child == wait(&status)); printf("wait: %x\n", status); return 0; } prints "pause: -1 Unknown error 514", -ERESTARTNOHAND leaks to the userland. In this case sys_pause() is buggy as well and should be fixed. I do not know what was the original rationality behind PTRACE_KILL. The man page is simply wrong and afaics it was always wrong. Imho it should be deprecated, or may be it should do send_sig(SIGKILL) as Denys suggests, but in any case I do not think that the current behaviour was intentional. Note: there is another problem, ptrace_resume() changes ->exit_code and this can race with SIGKILL too. Eventually we should change ptrace to not use ->exit_code. Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-05-25 19:20:21 +02:00
Linus Torvalds	19426a8f81	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers: RCU conversion	2011-05-25 08:58:50 -07:00
Mike Travis	162a7e7500	printk: allocate kernel log buffer earlier On larger systems, because of the numerous ACPI, Bootmem and EFI messages, the static log buffer overflows before the larger one specified by the log_buf_len param is allocated. Minimize the overflow by allocating the new log buffer as soon as possible. On kernels without memblock, a later call to setup_log_buf from kernel/init.c is the fallback. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_PRINTK=n build] Signed-off-by: Mike Travis <travis@sgi.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:48 -07:00
Mike Travis	4b060420a5	bitmap, irq: add smp_affinity_list interface to /proc/irq Manually adjusting the smp_affinity for IRQ's becomes unwieldy when the cpu count is large. Setting smp affinity to cpus 256 to 263 would be: echo 000000ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > smp_affinity instead of: echo 256-263 > smp_affinity_list Think about what it looks like for cpus around say, 4088 to 4095. We already have many alternate "list" interfaces: /sys/devices/system/cpu/cpuX/indexY/shared_cpu_list /sys/devices/system/cpu/cpuX/topology/thread_siblings_list /sys/devices/system/cpu/cpuX/topology/core_siblings_list /sys/devices/system/node/nodeX/cpulist /sys/devices/pci*/*/local_cpulist Add a companion interface, smp_affinity_list to use cpu lists instead of cpu maps. This conforms to other companion interfaces where both a map and a list interface exists. This required adding a bitmap_parselist_user() function in a manner similar to the bitmap_parse_user() function. [akpm@linux-foundation.org: make __bitmap_parselist() static] Signed-off-by: Mike Travis <travis@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jack Steiner <steiner@sgi.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:45 -07:00
KOSAKI Motohiro	de03c72cfc	mm: convert mm->cpu_vm_cpumask into cpumask_var_t cpumask_t is very big struct and cpu_vm_mask is placed wrong position. It might lead to reduce cache hit ratio. This patch has two change. 1) Move the place of cpumask into last of mm_struct. Because usually cpumask is accessed only front bits when the system has cpu-hotplug capability 2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory footprint if cpumask_size() will use nr_cpumask_bits properly in future. In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var. It may help to detect out of tree cpu_vm_mask users. This patch has no functional change. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Howells <dhowells@redhat.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:21 -07:00
Peter Zijlstra	3d48ae45e7	mm: Convert i_mmap_lock to a mutex Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:18 -07:00
Peter Zijlstra	97a894136f	mm: Remove i_mmap_lock lockbreak Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: `2aa15890f3` ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:17 -07:00
Peter Zijlstra	e4c70a6629	lockdep, mutex: provide mutex_lock_nest_lock In order to convert i_mmap_lock to a mutex we need a mutex equivalent to spin_lock_nest_lock(), thus provide the mutex_lock_nest_lock() annotation. As with spin_lock_nest_lock(), mutex_lock_nest_lock() allows annotation of the locking pattern where an outer lock serializes the acquisition order of nested locks. That is, if every time you lock multiple locks A, say A1 and A2 you first acquire N, the order of acquiring A1 and A2 is irrelevant. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:17 -07:00
Rafael J. Wysocki	f42a9813fb	PM / Hibernate: Update kerneldoc comments in hibernate.c Some of the kerneldoc comments in kernel/power/hibernate.c are outdated and some of them don't adhere to the kernel's standards. Update them and make them look in a consistent way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Randy Dunlap <randy.dunlap@oracle.com>	2011-05-24 23:36:06 +02:00
Rafael J. Wysocki	354258011e	PM / Hibernate: Remove arch_prepare_suspend() All architectures supporting hibernation define arch_prepare_suspend() as an empty function, so remove it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-24 23:35:55 +02:00
Linus Torvalds	b0ca118dba	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits) TOMOYO: Fix wrong domainname validation. SELINUX: add /sys/fs/selinux mount point to put selinuxfs CRED: Fix load_flat_shared_library() to initialise bprm correctly SELinux: introduce path_has_perm flex_array: allow 0 length elements flex_arrays: allow zero length flex arrays flex_array: flex_array_prealloc takes a number of elements, not an end SELinux: pass last path component in may_create SELinux: put name based create rules in a hashtable SELinux: generic hashtab entry counter SELinux: calculate and print hashtab stats with a generic function SELinux: skip filename trans rules if ttype does not match parent dir SELinux: rename filename_compute_type argument to type instead of con SELinux: fix comment to state filename_compute_type takes an objname not a qstr SMACK: smack_file_lock can use the struct path LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE SELINUX: Make selinux cache VFS RCU walks safe SECURITY: Move exec_permission RCU checks into security modules SELinux: security_read_policy should take a size_t not ssize_t ...	2011-05-24 13:38:19 -07:00
Linus Torvalds	5129df03d0	Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: Unify input section names percpu: Avoid extra NOP in percpu_cmpxchg16b_double percpu: Cast away printk format warning percpu: Always align percpu output section to PAGE_SIZE Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun	2011-05-24 11:53:42 -07:00
James Morris	434d42cfd0	Merge branch 'next' into for-linus	2011-05-24 22:55:24 +10:00
Eric Dumazet	8af088710d	posix-timers: RCU conversion Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer Ref: http://www.spinics.net/lists/kvm/msg51526.html Switching to RCU is quite easy, IDR being already RCU ready. idr_lock should be locked only for an insert/delete, not a lookup. Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real 1m18.669s user 0m1.346s sys 1m17.180s After : real 0m3.296s user 0m1.366s sys 0m1.926s Reported-by: Ben Nagy <ben@iagu.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Ben Nagy <ben@iagu.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Richard Cochran <richard.cochran@omicron.at> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-24 12:10:51 +02:00
Tejun Heo	6988f20fe0	Merge branch 'fixes-2.6.39' into for-2.6.40	2011-05-24 09:59:36 +02:00
Linus Torvalds	5214638384	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Fix sample type size calculation in 32 bits archs profile: Use vzalloc() rather than vmalloc() & memset()	2011-05-23 21:20:48 -07:00
Linus Torvalds	5f2e8e2b0b	kernel/watchdog.c: Use proper ANSI C prototypes We try to enforce it by using -Wstrict-prototypes, but apparently they sometimes get through. Introduced by `4eec42f392` ("watchdog: Change the default timeout and configure nmi watchdog period based"). Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-23 21:07:40 -07:00
Ingo Molnar	6e9101aeec	watchdog: Fix non-standard prototype of get_softlockup_thresh() This build warning slipped through: kernel/watchdog.c:102: warning: function declaration isn't a prototype As reported by Stephen Rothwell. Also address an unused variable warning that GCC 4.6.0 reports: we cannot do anything about failed watchdog ops during CPU hotplug (it's not serious enough to return an error from the notifier), so ignore them. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20110524134129.8da27016.sfr@canb.auug.org.au Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20110517071642.GF22305@elte.hu>	2011-05-24 05:53:39 +02:00
Rafael J. Wysocki	4e2d9491a7	PM / Hibernate: Update some comments in core hibernate code Some comments in the core hibernate code are outdated, some aren't necessary any more and at least one of them is plain wrong. Remove those comments or update them. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-24 00:21:26 +02:00
Linus Torvalds	15a3d11b0f	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Increase SCHED_LOAD_SCALE resolution sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations sched: Cleanup set_load_weight()	2011-05-23 12:53:48 -07:00
Linus Torvalds	1f3a8e093f	Merge branch 'staging-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 * 'staging-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (970 commits) staging: usbip: replace usbip_u{dbg,err,info} and printk with dev_ and pr_ staging:iio: Trivial kconfig reorganization and uniformity improvements. staging:iio:documenation partial update. staging:iio: use pollfunc allocation helpers in remaining drivers. staging:iio:max1363 misc cleanups and use of for_each_bit_set to simplify event code spitting out. staging:iio: implement an iio_info structure to take some of the constant elements out of iio_dev. staging:iio:meter:ade7758: Use private data space from iio_allocate_device staging:iio:accel:lis3l02dq make write_reg_8 take value not a pointer to value. staging:iio: ring core cleanups + check if read_last available in lis3l02dq staging:iio:core cleanup: squash tiny wrappers and use dev_set_name to handle creation of event interface name. staging:iio: poll func allocation clean up. staging:iio:ad7780 trivial unused header cleanup. staging:iio:adc: AD7780: Use private data space from iio_allocate_device + trivial fixes staging:iio:adc:AD7780: Convert to new channel registration method staging:iio:adc: AD7606: Drop dev_data in favour of iio_priv() staging:iio:adc: AD7606: Consitently use indio_dev staging:iio: Rip out helper for software rings. staging:iio:adc:AD7298: Use private data space from iio_allocate_device staging:iio: rationalization of different buffer implementation hooks. staging:iio:imu:adis16400 avoid allocating rx, tx, and state separately from iio_dev. ... Fix up trivial conflicts in - drivers/staging/intel_sst/intelmid.c: patches applied in both branches - drivers/staging/rt2860/common/cmm_data_{pci,usb}.c: removed vs spelling - drivers/staging/usbip/vhci_sysfs.c: trivial header file inclusion	2011-05-23 12:49:28 -07:00
Linus Torvalds	30cb6d5f2e	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimers: Reorder clock bases hrtimers: Avoid touching inactive timer bases hrtimers: Make struct hrtimer_cpu_base layout less stupid timerfd: Manage cancelable timers in timerfd clockevents: Move C3 stop test outside lock alarmtimer: Drop device refcount after rtc_open() alarmtimer: Check return value of class_find_device() timerfd: Allow timers to be cancelled when clock was set hrtimers: Prepare for cancel on clock was set timers	2011-05-23 11:30:28 -07:00
Linus Torvalds	19504828b4	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Fix sample size bit operations perf tools: Fix ommitted mmap data update on remap watchdog: Change the default timeout and configure nmi watchdog period based on watchdog_thresh watchdog: Disable watchdog when thresh is zero watchdog: Only disable/enable watchdog if neccessary watchdog: Fix rounding bug in get_sample_period() perf tools: Propagate event parse error handling perf tools: Robustify dynamic sample content fetch perf tools: Pre-check sample size before parsing perf tools: Move evlist sample helpers to evlist area perf tools: Remove junk code in mmap size handling perf tools: Check we are able to read the event size on mmap	2011-05-23 09:25:52 -07:00
Linus Torvalds	57d19e80f4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) b43: fix comment typo reqest -> request Haavard Skinnemoen has left Atmel cris: typo in mach-fs Makefile Kconfig: fix copy/paste-ism for dell-wmi-aio driver doc: timers-howto: fix a typo ("unsgined") perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). treewide: fix a few typos in comments regulator: change debug statement be consistent with the style of the rest Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations" audit: acquire creds selectively to reduce atomic op overhead rtlwifi: don't touch with treewide double semicolon removal treewide: cleanup continuations and remove logging message whitespace ath9k_hw: don't touch with treewide double semicolon removal include/linux/leds-regulator.h: fix syntax in example code tty: fix typo in descripton of tty_termios_encode_baud_rate xtensa: remove obsolete BKL kernel option from defconfig m68k: fix comment typo 'occcured' arch:Kconfig.locks Remove unused config option. treewide: remove extra semicolons ...	2011-05-23 09:12:26 -07:00
Ingo Molnar	8ce2616955	Merge commit '559fa6e76b27' into perf/urgent Merge reason: this commit was queued up quite some time ago but was forgotten about. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-23 16:15:58 +02:00
Thomas Gleixner	68fa61c026	hrtimers: Reorder clock bases The ordering of the clock bases is historical due to the CLOCK_REALTIME and CLOCK_MONOTONIC constants. Now the hrtimer bases have their own enumeration due to the gap between CLOCK_MONOTONIC and CLOCK_BOOTTIME. So we can be more clever as most timers end up on the CLOCK_MONOTONIC base due to the virtue of POSIX declaring that relative CLOCK_REALTIME timers are not affected by time changes. In desktop environments this is slowly changing as applications switch to absolute timers, but I've observed empty CLOCK_REALTIME bases often enough. There is no performance penalty or overhead when CLOCK_REALTIME timers are active, but in case they are not we don't skip over a full cache line. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Peter Zijlstra <peterz@infradead.org>	2011-05-23 13:59:54 +02:00
Thomas Gleixner	ab8177bc53	hrtimers: Avoid touching inactive timer bases Instead of iterating over all possible timer bases avoid it by marking the active bases in the cpu base. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Peter Zijlstra <peterz@infradead.org>	2011-05-23 13:59:54 +02:00
Thomas Gleixner	9ec2690758	timerfd: Manage cancelable timers in timerfd Peter is concerned about the extra scan of CLOCK_REALTIME_COS in the timer interrupt. Yes, I did not think about it, because the solution was so elegant. I didn't like the extra list in timerfd when it was proposed some time ago, but with a rcu based list the list walk it's less horrible than the original global lock, which was held over the list iteration. Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Peter Zijlstra <peterz@infradead.org>	2011-05-23 13:59:53 +02:00
Mandeep Singh Baines	4eec42f392	watchdog: Change the default timeout and configure nmi watchdog period based on watchdog_thresh Before the conversion of the NMI watchdog to perf event, the watchdog timeout was 5 seconds. Now it is 60 seconds. For my particular application, netbooks, 5 seconds was a better timeout. With a short timeout, we catch faults earlier and are able to send back a panic. With a 60 second timeout, the user is unlikely to wait and will instead hit the power button, causing us to lose the panic info. This change configures the NMI period to watchdog_thresh and sets the softlockup_thresh to watchdog_thresh * 2. In addition, watchdog_thresh was reduced to 10 seconds as suggested by Ingo Molnar. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-4-git-send-email-msb@chromium.org Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20110517071642.GF22305@elte.hu>	2011-05-23 11:58:59 +02:00
Mandeep Singh Baines	586692a5a5	watchdog: Disable watchdog when thresh is zero This restores the previous behavior of softlock_thresh. Currently, setting watchdog_thresh to zero causes the watchdog kthreads to consume a lot of CPU. In addition, the logic of proc_dowatchdog_thresh and proc_dowatchdog_enabled has been factored into proc_dowatchdog. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-3-git-send-email-msb@chromium.org Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20110517071018.GE22305@elte.hu>	2011-05-23 11:58:59 +02:00
Mandeep Singh Baines	e04ab2bc41	watchdog: Only disable/enable watchdog if neccessary Don't take any action on an unsuccessful write to /proc. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-2-git-send-email-msb@chromium.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-23 11:58:58 +02:00
Mandeep Singh Baines	824c6b7f62	watchdog: Fix rounding bug in get_sample_period() In get_sample_period(), softlockup_thresh is integer divided by 5 before the multiplication by NSEC_PER_SEC. This results in softlockup_thresh being rounded down to the nearest integer multiple of 5. For example, a softlockup_thresh of 4 rounds down to 0. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-1-git-send-email-msb@chromium.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-23 11:58:58 +02:00
Linus Torvalds	e98bae7592	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6: (28 commits) sparc32: fix build, fix missing cpu_relax declaration SCHED_TTWU_QUEUE is not longer needed since sparc32 now implements IPI sparc32,leon: Remove unnecessary page_address calls in LEON DMA API. sparc: convert old cpumask API into new one sparc32, sun4d: Implemented SMP IPIs support for SUN4D machines sparc32, sun4m: Implemented SMP IPIs support for SUN4M machines sparc32,leon: Implemented SMP IPIs for LEON CPU sparc32: implement SMP IPIs using the generic functions sparc32,leon: SMP power down implementation sparc32,leon: added some SMP comments sparc: add {read,write}*_be routines sparc32,leon: don't rely on bootloader to mask IRQs sparc32,leon: operate on boot-cpu IRQ controller registers sparc32: always define boot_cpu_id sparc32: removed unused code, implemented by generic code sparc32: avoid build warning at mm/percpu.c:1647 sparc32: always register a PROM based early console sparc32: probe for cpu info only during startup sparc: consolidate show_cpuinfo in cpu.c sparc32,leon: implement genirq CPU affinity ...	2011-05-22 22:06:24 -07:00
Kevin Cernekee	be84bfcc3e	ipc: Add missing sys_ni entries for ipc/compat.c functions When building with: CONFIG_64BIT=y CONFIG_MIPS32_COMPAT=y CONFIG_COMPAT=y CONFIG_MIPS32_O32=y CONFIG_MIPS32_N32=y CONFIG_SYSVIPC is not set (and implicitly: CONFIG_SYSVIPC_COMPAT is not set) the final link fails with unresolved symbols for: compat_sys_semctl, compat_sys_msgsnd, compat_sys_msgrcv, compat_sys_shmctl, compat_sys_msgctl, compat_sys_semtimedop The fix is to add cond_syscall declarations for all syscalls in ipc/compat.c Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-20 13:53:02 -07:00
Linus Torvalds	06f4e926d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits) macvlan: fix panic if lowerdev in a bond tg3: Add braces around 5906 workaround. tg3: Fix NETIF_F_LOOPBACK error macvlan: remove one synchronize_rcu() call networking: NET_CLS_ROUTE4 depends on INET irda: Fix error propagation in ircomm_lmp_connect_response() irda: Kill set but unused variable 'bytes' in irlan_check_command_param() irda: Kill set but unused variable 'clen' in ircomm_connect_indication() rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport() be2net: Kill set but unused variable 'req' in lancer_fw_download() irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication() atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined. rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer(). rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler() rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection() rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window() pkt_sched: Kill set but unused variable 'protocol' in tc_classify() isdn: capi: Use pr_debug() instead of ifdefs. tg3: Update version to 3.119 tg3: Apply rx_discards fix to 5719/5720 ... Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c as per Davem.	2011-05-20 13:43:21 -07:00
Linus Torvalds	102dc1bae1	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: MAINTAINERS: Add drivers/clocksource to TIMEKEEPING clockevents/source: Use u64 to make 32bit happy	2011-05-20 13:38:28 -07:00
Linus Torvalds	bc091c93a0	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: extable, core_kernel_data(): Make sure all archs define _sdata core_kernel_data(): Fix architectures that do not define _sdata	2011-05-20 13:37:22 -07:00
Linus Torvalds	3ed4c0583d	Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits) signal: trivial, fix the "timespec declared inside parameter list" warning job control: reorganize wait_task_stopped() ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping() signal: sys_sigprocmask() needs retarget_shared_pending() signal: cleanup sys_sigprocmask() signal: rename signandsets() to sigandnsets() signal: do_sigtimedwait() needs retarget_shared_pending() signal: introduce do_sigtimedwait() to factor out compat/native code signal: sys_rt_sigtimedwait: simplify the timeout logic signal: cleanup sys_rt_sigprocmask() x86: signal: sys_rt_sigreturn() should use set_current_blocked() x86: signal: handle_signal() should use set_current_blocked() signal: sigprocmask() should do retarget_shared_pending() signal: sigprocmask: narrow the scope of ->siglock signal: retarget_shared_pending: optimize while_each_thread() loop signal: retarget_shared_pending: consider shared/unblocked signals only signal: introduce retarget_shared_pending() ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/ signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending() ...	2011-05-20 13:33:21 -07:00
Daniel Hellstrom	17d9f311ec	SCHED_TTWU_QUEUE is not longer needed since sparc32 now implements IPI Signed-off-by: Daniel Hellstrom <daniel@gaisler.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-20 13:10:55 -07:00
David S. Miller	90d3ac15e5	Merge commit '317f394160e9beb97d19a84c39b7e5eb3d7815a8' Conflicts: arch/sparc/kernel/smp_32.c With merge conflict help from Daniel Hellstrom. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-20 13:10:22 -07:00
Linus Torvalds	268bb0ce3e	sanitize <linux/prefetch.h> usage Commit `e66eed651f` ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw(' -- '.[ch]') grep -L 'prefetchw(' $(git grep -l 'linux/prefetch.h' -- '.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-20 12:50:29 -07:00
Thomas Gleixner	250f972d85	Merge branch 'timers/urgent' into timers/core Reason: Get upstream fixes and kfree_rcu which is necessary for a follow up patch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-20 20:08:05 +02:00
Nikhil Rao	c8b281161d	sched: Increase SCHED_LOAD_SCALE resolution Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by a factor of 1024. With this extra resolution, we can handle deeper cgroup hiearchies and the scheduler can do better shares distribution and load load balancing on larger systems (especially for low weight task groups). This does not change the existing user interface, the scaled weights are only used internally. We do not modify prio_to_weight values or inverses, but use the original weights when calculating the inverse which is used to scale execution time delta in calc_delta_mine(). This ensures we do not lose accuracy when accounting time to the sched entities. Thanks to Nikunj Dadhania for fixing an bug in c_d_m() that broken fairness. Below is some analysis of the performance costs/improvements of this patch. 1. Micro-arch performance costs: Experiment was to run Ingo's pipe_test_100k 200 times with the task pinned to one cpu. I measured instruction, cycles and stalled-cycles for the runs. See: http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more info. -tip (baseline): Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs): 964,991,769 instructions # 0.82 insns per cycle # 0.33 stalled cycles per insn # ( +- 0.05% ) 1,171,186,635 cycles # 0.000 GHz ( +- 0.08% ) 306,373,664 stalled-cycles-backend # 26.16% backend cycles idle ( +- 0.28% ) 314,933,621 stalled-cycles-frontend # 26.89% frontend cycles idle ( +- 0.34% ) 1.122405684 seconds time elapsed ( +- 0.05% ) -tip+patches: Performance counter stats for './load-scale/pipe-test-100k' (200 runs): 963,624,821 instructions # 0.82 insns per cycle # 0.33 stalled cycles per insn # ( +- 0.04% ) 1,175,215,649 cycles # 0.000 GHz ( +- 0.08% ) 315,321,126 stalled-cycles-backend # 26.83% backend cycles idle ( +- 0.28% ) 316,835,873 stalled-cycles-frontend # 26.96% frontend cycles idle ( +- 0.29% ) 1.122238659 seconds time elapsed ( +- 0.06% ) With this patch, instructions decrease by ~0.10% and cycles increase by 0.27%. This doesn't look statistically significant. The number of stalled cycles in the backend increased from 26.16% to 26.83%. This can be attributed to the shifts we do in c_d_m() and other places. The fraction of stalled cycles in the frontend remains about the same, at 26.96% compared to 26.89% in -tip. 2. Balancing low-weight task groups Test setup: run 50 tasks with random sleep/busy times (biased around 100ms) in a low weight container (with cpu.shares = 2). Measure %idle as reported by mpstat over a 10s window. -tip (baseline): 06:47:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s 06:47:49 PM all 94.32 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.62 15888.00 06:47:50 PM all 94.57 0.00 0.62 0.00 0.00 0.00 0.00 0.00 4.81 16180.00 06:47:51 PM all 94.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.25 15966.00 06:47:52 PM all 95.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.19 16053.00 06:47:53 PM all 94.88 0.06 0.00 0.00 0.00 0.00 0.00 0.00 5.06 15984.00 06:47:54 PM all 93.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.69 15806.00 06:47:55 PM all 94.19 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.75 15896.00 06:47:56 PM all 92.87 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.13 15716.00 06:47:57 PM all 94.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.12 15982.00 06:47:58 PM all 95.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 16075.00 Average: all 94.49 0.01 0.08 0.00 0.00 0.00 0.00 0.00 5.42 15954.60 -tip+patches: 06:47:03 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s 06:47:04 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16630.00 06:47:05 PM all 99.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 16580.20 06:47:06 PM all 99.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.25 16596.00 06:47:07 PM all 99.20 0.00 0.74 0.00 0.00 0.06 0.00 0.00 0.00 17838.61 06:47:08 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16540.00 06:47:09 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16575.00 06:47:10 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16614.00 06:47:11 PM all 99.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 16588.00 06:47:12 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16593.00 06:47:13 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16551.00 Average: all 99.84 0.00 0.09 0.00 0.00 0.01 0.00 0.00 0.06 16711.58 We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06% with the patches). We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06% with the patches). Signed-off-by: Nikhil Rao <ncrao@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1305754668-18792-1-git-send-email-ncrao@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 14:16:50 +02:00
Nikhil Rao	1399fa7807	sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations SCHED_LOAD_SCALE is used to increase nice resolution and to scale cpu_power calculations in the scheduler. This patch introduces SCHED_POWER_SCALE and converts all uses of SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE instead. This is a preparatory patch for increasing the resolution of SCHED_LOAD_SCALE, and there is no need to increase resolution for cpu_power calculations. Signed-off-by: Nikhil Rao <ncrao@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de> Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 14:16:50 +02:00
Nikhil Rao	f05998d4b8	sched: Cleanup set_load_weight() Avoid using long repetitious names; make this simpler and nicer to read. No functional change introduced in this patch. Signed-off-by: Nikhil Rao <ncrao@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de> Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/1305738580-9924-2-git-send-email-ncrao@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 14:16:49 +02:00
Tejun Heo	9c5a2ba702	workqueue: separate out drain_workqueue() from destroy_workqueue() There are users which want to drain workqueues without destroying it. Separate out drain functionality from destroy_workqueue() into drain_workqueue() and make it accessible to workqueue users. To guarantee forward-progress, only chain queueing is allowed while drain is in progress. If a new work item which isn't chained from the running or pending work items is queued while draining is in progress, WARN_ON_ONCE() is triggered. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <James.Bottomley@hansenpartnership.com>	2011-05-20 13:54:46 +02:00
Thomas Gleixner	c0e299b1a9	clockevents/source: Use u64 to make 32bit happy unsigned long is not 64bit on 32bit machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-20 10:50:52 +02:00
Steven Rostedt	a2d063ac21	extable, core_kernel_data(): Make sure all archs define _sdata A new utility function (core_kernel_data()) is used to determine if a passed in address is part of core kernel data or not. It may or may not return true for RO data, but this utility must work for RW data. Thus both _sdata and _edata must be defined and continuous, without .init sections that may later be freed and replaced by volatile memory (memory that can be freed). This utility function is used to determine if data is safe from ever being freed. Thus it should return true for all RW global data that is not in a module or has been allocated, or false otherwise. Also change core_kernel_data() back to the more precise _sdata condition and document the function. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Hirokazu Takata <takata@linux-m32r.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: linux-m68k@lists.linux-m68k.org Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Helge Deller <deller@gmx.de> Cc: JamesE.J.Bottomley <jejb@parisc-linux.org> Link: http://lkml.kernel.org/r/1305855298.1465.19.camel@gandalf.stny.rr.com Signed-off-by: Ingo Molnar <mingo@elte.hu> ---- arch/alpha/kernel/vmlinux.lds.S \| 1 + arch/m32r/kernel/vmlinux.lds.S \| 1 + arch/m68k/kernel/vmlinux-std.lds \| 2 ++ arch/m68k/kernel/vmlinux-sun3.lds \| 1 + arch/mips/kernel/vmlinux.lds.S \| 1 + arch/parisc/kernel/vmlinux.lds.S \| 3 +++ kernel/extable.c \| 12 +++++++++++- 7 files changed, 20 insertions(+), 1 deletion(-)	2011-05-20 08:56:56 +02:00
Ingo Molnar	c16dbd54a3	Merge branch 'perf/core' into perf/urgent Merge reason: One pending commit was left in perf/core after Linus merged perf/core - continue v2.6.40 work in the perf/urgent reason. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 08:54:08 +02:00
Chris Metcalf	571d76acda	arch/tile: support signal "exception-trace" hook This change adds support for /proc/sys/debug/exception-trace to tile. Like x86 and sparc, by default it is set to "1", generating a one-line printk whenever a user process crashes. By setting it to "2", we get a much more complete userspace diagnostic at crash time, including a user-space backtrace, register dump, and memory dump around the address of the crash. Some vestiges of the Tilera-internal version of this support are removed with this patch (the show_crashinfo variable and the arch_coredump_signal function). We retain a "crashinfo" boot parameter which allows you to set the boot-time value of exception-trace. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>	2011-05-19 22:55:59 -04:00
Linus Torvalds	39ab05c8e0	Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits) debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning sysfs: remove "last sysfs file:" line from the oops messages drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION" memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION SYSFS: Fix erroneous comments for sysfs_update_group(). driver core: remove the driver-model structures from the documentation driver core: Add the device driver-model structures to kerneldoc Translated Documentation/email-clients.txt RAW driver: Remove call to kobject_put(). reboot: disable usermodehelper to prevent fs access efivars: prevent oops on unload when efi is not enabled Allow setting of number of raw devices as a module parameter Introduce CONFIG_GOOGLE_FIRMWARE driver: Google Memory Console driver: Google EFI SMI x86: Better comments for get_bios_ebda() x86: get_bios_ebda_length() misc: fix ti-st build issues params.c: Use new strtobool function to process boolean inputs debugfs: move to new strtobool ... Fix up trivial conflicts in fs/debugfs/file.c due to the same patch being applied twice, and an unrelated cleanup nearby.	2011-05-19 18:24:11 -07:00
Linus Torvalds	eb04f2f04e	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits) Revert "rcu: Decrease memory-barrier usage based on semi-formal proof" net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree() batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu() net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu() net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu() perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu() perf,rcu: convert call_rcu(free_ctx) to kfree_rcu() net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(net_generic_release) to kfree_rcu() net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu() net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu() security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu() net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu() net,rcu: convert call_rcu(xps_map_release) to kfree_rcu() net,rcu: convert call_rcu(rps_map_release) to kfree_rcu() ...	2011-05-19 18:14:34 -07:00
Linus Torvalds	78c4def67e	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: Make lookup table const RTC: Disable CONFIG_RTC_CLASS from being built as a module timers: Fix alarmtimer build issues when CONFIG_RTC_CLASS=n timers: Remove delayed irqwork from alarmtimers implementation timers: Improve alarmtimer comments and minor fixes timers: Posix interface for alarm-timers timers: Introduce in-kernel alarm-timer interface timers: Add rb_init_node() to allow for stack allocated rb nodes time: Add timekeeping_inject_sleeptime	2011-05-19 17:45:08 -07:00
Linus Torvalds	7e6628e4bc	Merge branch 'timers-clockevents-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-clockevents-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: hpet: Cleanup the clockevents init and register code x86: Convert PIT to clockevents_config_and_register() clockevents: Provide interface to reconfigure an active clock event device clockevents: Provide combined configure and register function clockevents: Restructure clock_event_device members clocksource: Get rid of the hardcoded 5 seconds sleep time limit clocksource: Restructure clocksource struct members	2011-05-19 17:44:40 -07:00
Linus Torvalds	80fe02b5da	Merge branches 'sched-core-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (60 commits) sched: Fix and optimise calculation of the weight-inverse sched: Avoid going ahead if ->cpus_allowed is not changed sched, rt: Update rq clock when unthrottling of an otherwise idle CPU sched: Remove unused parameters from sched_fork() and wake_up_new_task() sched: Shorten the construction of the span cpu mask of sched domain sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG sched: Remove unused 'this_best_prio arg' from balance_tasks() sched: Remove noop in alloc_rt_sched_group() sched: Get rid of lock_depth sched: Remove obsolete comment from scheduler_tick() sched: Fix sched_domain iterations vs. RCU sched: Next buddy hint on sleep and preempt path sched: Make set__buddy() work on non-task entities sched: Remove need_migrate_task() sched: Move the second half of ttwu() to the remote cpu sched: Restructure ttwu() some more sched: Rename ttwu_post_activation() to ttwu_do_wakeup() sched: Remove rq argument from ttwu_stat() sched: Remove rq->lock from the first half of ttwu() sched: Drop rq->lock from sched_exec() ... 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix rt_rq runtime leakage bug	2011-05-19 17:41:22 -07:00
Linus Torvalds	df48d8716e	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (107 commits) perf stat: Add more cache-miss percentage printouts perf stat: Add -d -d and -d -d -d options to show more CPU events ftrace/kbuild: Add recordmcount files to force full build ftrace: Add self-tests for multiple function trace users ftrace: Modify ftrace_set_filter/notrace to take ops ftrace: Allow dynamically allocated function tracers ftrace: Implement separate user function filtering ftrace: Free hash with call_rcu_sched() ftrace: Have global_ops store the functions that are to be traced ftrace: Add ops parameter to ftrace_startup/shutdown functions ftrace: Add enabled_functions file ftrace: Use counters to enable functions to trace ftrace: Separate hash allocation and assignment ftrace: Create a global_ops to hold the filter and notrace hashes ftrace: Use hash instead for FTRACE_FL_FILTER ftrace: Replace FTRACE_FL_NOTRACE flag with a hash of ignored functions perf bench, x86: Add alternatives-asm.h wrapper x86, 64-bit: Fix copy_[to/from]_user() checks for the userspace address limit x86, mem: memset_64.S: Optimize memset by enhanced REP MOVSB/STOSB x86, mem: memmove_64.S: Optimize memmove by enhanced REP MOVSB/STOSB ...	2011-05-19 17:36:08 -07:00
Linus Torvalds	acd30250d7	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq: Export functions to allow modular irq drivers genirq: Uninline and sanity check generic_handle_irq() genirq: Remove pointless ifdefs genirq: Make generic irq chip depend on CONFIG_GENERIC_IRQ_CHIP genirq: Add chip suspend and resume callbacks genirq: Implement a generic interrupt chip genirq: Support per-IRQ thread disabling. genirq: irq_desc: Document preflow_handler and affinity_hint genirq: Update DocBook comments genirq: Forgotten updates/deletions after removal of compat code	2011-05-19 17:30:15 -07:00
Linus Torvalds	6595b4a940	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: seqlock: Don't smp_rmb in seqlock reader spin loop watchdog, hung_task_timeout: Add Kconfig configurable default lockdep: Remove cmpxchg to update nr_chain_hlocks lockdep: Print a nicer description for simple irq lock inversions lockdep: Replace "Bad BFS generated tree" message with something less cryptic lockdep: Print a nicer description for irq inversion bugs lockdep: Print a nicer description for simple deadlocks lockdep: Print a nicer description for normal deadlocks lockdep: Print a nicer description for irq lock inversions	2011-05-19 17:29:29 -07:00
Linus Torvalds	51509a283a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: (34 commits) PM: Introduce generic prepare and complete callbacks for subsystems PM: Allow drivers to allocate memory from .prepare() callbacks safely PM: Remove CONFIG_PM_VERBOSE Revert "PM / Hibernate: Reduce autotuned default image size" PM / Hibernate: Add sysfs knob to control size of memory for drivers PM / Wakeup: Remove useless synchronize_rcu() call kmod: always provide usermodehelper_disable() PM / ACPI: Remove acpi_sleep=s4_nonvs PM / Wakeup: Fix build warning related to the "wakeup" sysfs file PM: Print a warning if firmware is requested when tasks are frozen PM / Runtime: Rework runtime PM handling during driver removal Freezer: Use SMP barriers PM / Suspend: Do not ignore error codes returned by suspend_enter() PM: Fix build issue in clock_ops.c for CONFIG_PM_RUNTIME unset PM: Revert "driver core: platform_bus: allow runtime override of dev_pm_ops" OMAP1 / PM: Use generic clock manipulation routines for runtime PM PM: Remove sysdev suspend, resume and shutdown operations PM / PowerPC: Use struct syscore_ops instead of sysdevs for PM PM / UNICORE32: Use struct syscore_ops instead of sysdevs for PM PM / AVR32: Use struct syscore_ops instead of sysdevs for PM ...	2011-05-19 16:46:07 -07:00
Ingo Molnar	c5fc472171	core_kernel_data(): Fix architectures that do not define _sdata Some architectures such as Alpha do not define _sdata but _data: kernel/built-in.o: In function `core_kernel_data': kernel/extable.c:77: undefined reference to `_sdata' So expand the scope of the data range to the text addresses too, this might be more correct anyway because this way we can cover readonly variables as well. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/n/tip-i878c8a0e0g0ep4v7i6vxnhz@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 01:27:16 +02:00
Randy Dunlap	d410fa4ef9	Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/ to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/<moved_file> to Documentation/security/<moved_file>.	2011-05-19 15:59:38 -07:00
Paul E. McKenney	80d02085d9	Revert "rcu: Decrease memory-barrier usage based on semi-formal proof" This reverts commit `e59fb3120b`. This reversion was due to (extreme) boot-time slowdowns on SPARC seen by Yinghai Lu and on x86 by Ingo . This is a non-trivial reversion due to intervening commits. Conflicts: Documentation/RCU/trace.txt kernel/rcutree.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-19 23:25:29 +02:00
Thomas Gleixner	80b816b736	clockevents: Provide interface to reconfigure an active clock event device Some ARM SoCs have clock event devices which have their frequency modified due to frequency scaling. Provide an interface which allows to reconfigure an active device. After reconfiguration reprogram the current pending event. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: LAK <linux-arm-kernel@lists.infradead.org> Cc: John Stultz <john.stultz@linaro.org> Acked-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Ingo Molnar <mingo@elte.hu> Link: http://lkml.kernel.org/r/%3C20110518210136.437459958%40linutronix.de%3E	2011-05-19 14:24:16 +02:00
Thomas Gleixner	57f0fcbe1d	clockevents: Provide combined configure and register function All clockevent devices have the same open coded initialization functions. Provide an interface which does all necessary initialization in the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Reviewed-by: Ingo Molnar <mingo@elte.hu> Link: http://lkml.kernel.org/r/%3C20110518210136.331975870%40linutronix.de%3E	2011-05-19 14:24:15 +02:00
Thomas Gleixner	724ed53e8a	clocksource: Get rid of the hardcoded 5 seconds sleep time limit Slow clocksources can have a way longer sleep time than 5 seconds and even fast ones can easily cope with 600 seconds and still maintain proper accuracy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Reviewed-by: Ingo Molnar <mingo@elte.hu> Link: http://lkml.kernel.org/r/%3C20110518210136.109811585%40linutronix.de%3E	2011-05-19 14:24:15 +02:00
James Morris	12a5a2621b	Merge branch 'master' into next Conflicts: include/linux/capability.h Manually resolve merge conflict w/ thanks to Stephen Rothwell. Signed-off-by: James Morris <jmorris@namei.org>	2011-05-19 18:51:57 +10:00
Jonathan Cameron	f721a465cd	params.c: Use new strtobool function to process boolean inputs Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-05-19 16:55:28 +09:30
Alessio Igor Bogani	9d63487f86	module: Use binary search in lookup_symbol() The function is_exported() with its helper function lookup_symbol() are used to verify if a provided symbol is effectively exported by the kernel or by the modules. Now that both have their symbols sorted we can replace a linear search with a binary search which provide a considerably speed-up. This work was supported by a hardware donation from the CE Linux Forum. Signed-off-by: Alessio Igor Bogani <abogani@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-05-19 16:55:27 +09:30
Alessio Igor Bogani	403ed27846	module: Use the binary search for symbols resolution Takes advantage of the order and locates symbols using binary search. This work was supported by a hardware donation from the CE Linux Forum. Signed-off-by: Alessio Igor Bogani <abogani@kernel.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: Dirk Behme <dirk.behme@googlemail.com>	2011-05-19 16:55:27 +09:30
Rusty Russell	de4d8d5346	module: each_symbol_section instead of each_symbol Instead of having a callback function for each symbol in the kernel, have a callback for each array of symbols. This eases the logic when we move to sorted symbols and binary search. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Alessio Igor Bogani <abogani@kernel.org>	2011-05-19 16:55:26 +09:30
Jan Glauber	01526ed083	module: split unset_section_ro_nx function. Split the unprotect function into a function per section to make the code more readable and add the missing static declaration. Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-05-19 16:55:26 +09:30
Jan Glauber	448694a1d5	module: undo module RONX protection correctly. While debugging I stumbled over two problems in the code that protects module pages. First issue is that disabling the protection before freeing init or unload of a module is not symmetric with the enablement. For instance, if pages are set to RO the page range from module_core to module_core + core_ro_size is protected. If a module is unloaded the page range from module_core to module_core + core_size is set back to RW. So pages that were not set to RO are also changed to RW. This is not critical but IMHO it should be symmetric. Second issue is that while set_memory_rw & set_memory_ro are used for RO/RW changes only set_memory_nx is involved for NX/X. One would await that the inverse function is called when the NX protection should be removed, which is not the case here, unless I'm missing something. Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-05-19 16:55:26 +09:30
Jan Glauber	4d10380e72	module: zero mod->init_ro_size after init is freed. Reset mod->init_ro_size to zero after the init part of a module is unloaded. Otherwise we need to check if module->init is NULL in the unprotect functions in the next patch. Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-05-19 16:55:26 +09:30
Daniel J Blueman	5d05c70849	minor ANSI prototype sparse fix Fix function prototype to be ANSI-C compliant, consistent with other function prototypes, addressing a sparse warning. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-05-19 16:55:25 +09:30
Dmitry Torokhov	b4bc842802	module: deal with alignment issues in built-in module versions On m68k natural alignment is 2-byte boundary but we are trying to align structures in __modver section on sizeof(void *) boundary. This causes trouble when we try to access elements in this section in array-like fashion when create "version" attributes for built-in modules. Moreover, as DaveM said, we can't reliably put structures into independent objects, put them into a special section, and then expect array access over them (via the section boundaries) after linking the objects together to just "work" due to variable alignment choices in different situations. The only solution that seems to work reliably is to make an array of plain pointers to the objects in question and put those pointers in the special section. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dmitry Torokhov <dtor@vmware.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-05-19 16:55:24 +09:30
Steven Rostedt	95950c2ecb	ftrace: Add self-tests for multiple function trace users Add some basic sanity tests for multiple users of the function tracer at startup. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 19:24:51 -04:00
Steven Rostedt	936e074b28	ftrace: Modify ftrace_set_filter/notrace to take ops Since users of the function tracer can now pick and choose which functions they want to trace agnostically from other users of the function tracer, we need to pass the ops struct to the ftrace_set_filter() functions. The functions ftrace_set_global_filter() and ftrace_set_global_notrace() is added to keep the old filter functions which are used to modify the generic function tracers. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 19:22:52 -04:00
Steven Rostedt	cdbe61bfe7	ftrace: Allow dynamically allocated function tracers Now that functions may be selected individually, it only makes sense that we should allow dynamically allocated trace structures to be traced. This will allow perf to allocate a ftrace_ops structure at runtime and use it to pick and choose which functions that structure will trace. Note, a dynamically allocated ftrace_ops will always be called indirectly instead of being called directly from the mcount in entry.S. This is because there's no safe way to prevent mcount from being preempted before calling the function, unless we modify every entry.S to do so (not likely). Thus, dynamically allocated functions will now be called by the ftrace_ops_list_func() that loops through the ops that are allocated if there are more than one op allocated at a time. This loop is protected with a preempt_disable. To determine if an ftrace_ops structure is allocated or not, a new util function was added to the kernel/extable.c called core_kernel_data(), which returns 1 if the address is between _sdata and _edata. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:51 -04:00
Steven Rostedt	b848914ce3	ftrace: Implement separate user function filtering ftrace_ops that are registered to trace functions can now be agnostic to each other in respect to what functions they trace. Each ops has their own hash of the functions they want to trace and a hash to what they do not want to trace. A empty hash for the functions they want to trace denotes all functions should be traced that are not in the notrace hash. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:50 -04:00
Steven Rostedt	07fd5515f3	ftrace: Free hash with call_rcu_sched() When a hash is modified and might be in use, we need to perform a schedule RCU operation on it, as the hashes will soon be used directly in the function tracer callback. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:50 -04:00
Steven Rostedt	2b499381bc	ftrace: Have global_ops store the functions that are to be traced This is a step towards each ops structure defining its own set of functions to trace. As the current code with pid's and such are specific to the global_ops, it is restructured to be used with the global ops. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:49 -04:00
Steven Rostedt	bd69c30b1d	ftrace: Add ops parameter to ftrace_startup/shutdown functions In order to allow different ops to enable different functions, the ftrace_startup() and ftrace_shutdown() functions need the ops parameter passed to them. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:48 -04:00
Steven Rostedt	647bcd03d5	ftrace: Add enabled_functions file Add the enabled_functions file that is used to show all the functions that have been enabled for tracing as well as their ref counts. This helps seeing if any function has been registered and what functions are being traced. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:47 -04:00
Steven Rostedt	ed926f9b35	ftrace: Use counters to enable functions to trace Every function has its own record that stores the instruction pointer and flags for the function to be traced. There are only two flags: enabled and free. The enabled flag states that tracing for the function has been enabled (actively traced), and the free flag states that the record no longer points to a function and can be used by new functions (loaded modules). These flags are now moved to the MSB of the flags (actually just the top 32bits). The rest of the bits (30 bits) are now used as a ref counter. Everytime a tracer register functions to trace, those functions will have its counter incremented. When tracing is enabled, to determine if a function should be traced, the counter is examined, and if it is non-zero it is set to trace. When a ftrace_ops is registered to trace functions, its hashes are examined. If the ftrace_ops filter_hash count is zero, then all functions are set to be traced, otherwise only the functions in the hash are to be traced. The exception to this is if a function is also in the ftrace_ops notrace_hash. Then that function's counter is not incremented for this ftrace_ops. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:47 -04:00
Steven Rostedt	33dc9b1267	ftrace: Separate hash allocation and assignment When filtering, allocate a hash to insert the function records. After the filtering is complete, assign it to the ftrace_ops structure. This allows the ftrace_ops structure to have a much smaller array of hash buckets instead of wasting a lot of memory. A read only empty_hash is created to be the minimum size that any ftrace_ops can point to. When a new hash is created, it has the following steps: o Allocate a default hash. o Walk the function records assigning the filtered records to the hash o Allocate a new hash with the appropriate size buckets o Move the entries from the default hash to the new hash. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:46 -04:00
Steven Rostedt	f45948e898	ftrace: Create a global_ops to hold the filter and notrace hashes Combine the filter and notrace hashes to be accessed by a single entity, the global_ops. The global_ops is a ftrace_ops structure that is passed to different functions that can read or modify the filtering of the function tracer. The ftrace_ops structure was modified to hold a filter and notrace hashes so that later patches may allow each ftrace_ops to have its own set of rules to what functions may be filtered. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:45 -04:00
Steven Rostedt	1cf41dd799	ftrace: Use hash instead for FTRACE_FL_FILTER When multiple users are allowed to have their own set of functions to trace, having the FTRACE_FL_FILTER flag will not be enough to handle the accounting of those users. Each user will need their own set of functions. Replace the FTRACE_FL_FILTER with a filter_hash instead. This is temporary until the rest of the function filtering accounting gets in. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:44 -04:00
Steven Rostedt	b448c4e3ae	ftrace: Replace FTRACE_FL_NOTRACE flag with a hash of ignored functions To prepare for the accounting system that will allow multiple users of the function tracer, having the FTRACE_FL_NOTRACE as a flag in the dyn_trace record does not make sense. All ftrace_ops will soon have a hash of functions they should trace and not trace. By making a global hash of functions not to trace makes this easier for the transition. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-18 15:29:44 -04:00
Jonathan Cameron	edf76f8307	irq: Export functions to allow modular irq drivers Export handle_simple_irq, irq_modify_status, irq_alloc_descs, irq_free_descs and generic_handle_irq to allow their usage in modules. First user is IIO, which wants to be built modular, but needs to be able to create irq chips, allocate and configure interrupt descriptors and handle demultiplexing interrupts. [ tglx: Moved the uninlinig of generic_handle_irq to a separate patch ] Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Link: http://lkml.kernel.org/r/%3C1305711544-505-1-git-send-email-jic23%40cam.ac.uk%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-18 14:59:08 +02:00
Thomas Gleixner	fe12bc2c99	genirq: Uninline and sanity check generic_handle_irq() generic_handle_irq() is missing a NULL pointer check for the result of irq_to_desc. This was a not a big problem, but we want to expose it to drivers, so we better have sanity checks in place. Add a return value as well, which indicates that the irq number was valid and the handler was invoked. Based on the pure code move from Jonathan Cameron. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jonathan Cameron <jic23@cam.ac.uk>	2011-05-18 14:59:08 +02:00
Thomas Gleixner	fe05143484	genirq: Remove pointless ifdefs kernel/irq/ is only built when CONFIG_GENERIC_HARDIRQS=y. So making code inside of kernel/irq/ conditional on CONFIG_GENERIC_HARDIRQS is pointless. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-18 14:59:07 +02:00
Rafael J. Wysocki	91e7c75ba9	PM: Allow drivers to allocate memory from .prepare() callbacks safely If device drivers allocate substantial amounts of memory (above 1 MB) in their hibernate .freeze() callbacks (or in their legacy suspend callbcks during hibernation), the subsequent creation of hibernate image may fail due to the lack of memory. This is the case, because the drivers' .freeze() callbacks are executed after the hibernate memory preallocation has been carried out and the preallocated amount of memory may be too small to cover the new driver allocations. Unfortunately, the drivers' .prepare() callbacks also are executed after the hibernate memory preallocation has completed, so they are not suitable for allocating additional memory either. Thus the only way a driver can safely allocate memory during hibernation is to use a hibernate/suspend notifier. However, the notifiers are called before the freezing of user space and the drivers wanting to use them for allocating additional memory may not know how much memory needs to be allocated at that point. To let device drivers overcome this difficulty rework the hibernation sequence so that the memory preallocation is carried out after the drivers' .prepare() callbacks have been executed, so that the .prepare() callbacks can be used for allocating additional memory to be used by the drivers' .freeze() callbacks. Update documentation to match the new behavior of the code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-17 23:26:00 +02:00
Rafael J. Wysocki	c650da23d5	PM: Remove CONFIG_PM_VERBOSE Now that we have CONFIG_DYNAMIC_DEBUG there is no need for yet another flag causing dev_dbg() and pr_debug() statements in the core PM code to produce output. Moreover, CONFIG_PM_VERBOSE causes so much output to be generated that it's not really useful and almost no one sets it. References: https://bugzilla.kernel.org/show_bug.cgi?id=23182 Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-17 23:25:10 +02:00
Rafael J. Wysocki	290c748725	Merge branch 'power-domains' into for-linus * power-domains: PM: Fix build issue in clock_ops.c for CONFIG_PM_RUNTIME unset PM: Revert "driver core: platform_bus: allow runtime override of dev_pm_ops" OMAP1 / PM: Use generic clock manipulation routines for runtime PM PM / Runtime: Generic clock manipulation rountines for runtime PM (v6) PM / Runtime: Add subsystem data field to struct dev_pm_info OMAP2+ / PM: move runtime PM implementation to use device power domains PM / Platform: Use generic runtime PM callbacks directly shmobile: Use power domains for platform runtime PM PM: Export platform bus type's default PM callbacks PM: Make power domain callbacks take precedence over subsystem ones	2011-05-17 23:23:46 +02:00
Rafael J. Wysocki	2d2a9163bd	Merge branch 'syscore' into for-linus * syscore: PM: Remove sysdev suspend, resume and shutdown operations PM / PowerPC: Use struct syscore_ops instead of sysdevs for PM PM / UNICORE32: Use struct syscore_ops instead of sysdevs for PM PM / AVR32: Use struct syscore_ops instead of sysdevs for PM PM / Blackfin: Use struct syscore_ops instead of sysdevs for PM ARM / Samsung: Use struct syscore_ops for "core" power management ARM / PXA: Use struct syscore_ops for "core" power management ARM / SA1100: Use struct syscore_ops for "core" power management ARM / Integrator: Use struct syscore_ops for core PM ARM / OMAP: Use struct syscore_ops for "core" power management ARM: Use struct syscore_ops instead of sysdevs for PM in common code	2011-05-17 23:23:40 +02:00
Rafael J. Wysocki	1c1be3a949	Revert "PM / Hibernate: Reduce autotuned default image size" This reverts commit `bea3864fb6` (PM / Hibernate: Reduce autotuned default image size), because users are now able to resolve the issue this commit was supposed to address in a different way (i.e. by using the new /sys/power/reserved_size interface). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-17 23:19:19 +02:00
Rafael J. Wysocki	ddeb648708	PM / Hibernate: Add sysfs knob to control size of memory for drivers Martin reports that on his system hibernation occasionally fails due to the lack of memory, because the radeon driver apparently allocates too much of it during the device freeze stage. It turns out that the amount of memory allocated by radeon during hibernation (and presumably during system suspend too) depends on the utilization of the GPU (e.g. hibernating while there are two KDE 4 sessions with compositing enabled causes radeon to allocate more memory than for one KDE 4 session). In principle it should be possible to use image_size to make the memory preallocation mechanism free enough memory for the radeon driver, but in practice it is not easy to guess the right value because of the way the preallocation code uses image_size. For this reason, it seems reasonable to allow users to control the amount of memory reserved for driver allocations made after the hibernate preallocation, which currently is constant and amounts to 1 MB. Introduce a new sysfs file, /sys/power/reserved_size, whose value will be used as the amount of memory to reserve for the post-preallocation reservations made by device drivers, in bytes. For backwards compatibility, set its default (and initial) value to the currently used number (1 MB). References: https://bugzilla.kernel.org/show_bug.cgi?id=34102 Reported-and-tested-by: Martin Steigerwald <Martin@Lichtvoll.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-17 23:19:19 +02:00
Kay Sievers	13d53f8775	kmod: always provide usermodehelper_disable() We need to prevent kernel-forked processes during system poweroff. Such processes try to access the filesystem whose disks we are trying to shutdown at the same time. This causes delays and exceptions in the storage drivers. A follow-up patch will add these calls and need usermodehelper_disable() also on systems without suspend support. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-17 23:19:18 +02:00
Rafael J. Wysocki	a144c6a6c9	PM: Print a warning if firmware is requested when tasks are frozen Some drivers erroneously use request_firmware() from their ->resume() (or ->thaw(), or ->restore()) callbacks, which is not going to work unless the firmware has been built in. This causes system resume to stall until the firmware-loading timeout expires, which makes users think that the resume has failed and reboot their machines unnecessarily. For this reason, make _request_firmware() print a warning and return immediately with error code if it has been called when tasks are frozen and it's impossible to start any new usermode helpers. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Reviewed-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>	2011-05-17 23:19:17 +02:00
Mike Frysinger	ee940d8dcc	Freezer: Use SMP barriers The freezer processes are dealing with multiple threads running simultaneously, and on a UP system, the memory reads/writes do not need barriers to keep things in sync. These are only needed on SMP systems, so use SMP barriers instead. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-17 23:19:17 +02:00
MyungJoo Ham	3c43193608	PM / Suspend: Do not ignore error codes returned by suspend_enter() The current implementation of suspend-to-RAM returns 0 if there is an error from suspend_enter(), because suspend_devices_and_enter() ignores the return value from suspend_enter(). This patch addresses this issue and properly keep the error return from suspend_enter() and let suspend_devices_and_enter relay the error return. Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-05-17 23:19:16 +02:00
Linus Torvalds	a085963a27	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tick: Clear broadcast active bit when switching to oneshot rtc: mc13xxx: Don't call rtc_device_register while holding lock rtc: rp5c01: Initialize drvdata before registering device rtc: pcap: Initialize drvdata before registering device rtc: msm6242: Initialize drvdata before registering device rtc: max8998: Initialize drvdata before registering device rtc: max8925: Initialize drvdata before registering device rtc: m41t80: Initialize clientdata before registering device rtc: ds1286: Initialize drvdata before registering device rtc: ep93xx: Initialize drvdata before registering device rtc: davinci: Initialize drvdata before registering device rtc: mxc: Initialize drvdata before registering device clocksource: Install completely before selecting	2011-05-17 08:02:04 -07:00
Thomas Gleixner	07f4beb0b5	tick: Clear broadcast active bit when switching to oneshot The first cpu which switches from periodic to oneshot mode switches also the broadcast device into oneshot mode. The broadcast device serves as a backup for per cpu timers which stop in deeper C-states. To avoid starvation of the cpus which might be in idle and depend on broadcast mode it marks the other cpus as broadcast active and sets the brodcast expiry value of those cpus to the next tick. The oneshot mode broadcast bit for the other cpus is sticky and gets only cleared when those cpus exit idle. If a cpu was not idle while the bit got set in consequence the bit prevents that the broadcast device is armed on behalf of that cpu when it enters idle for the first time after it switched to oneshot mode. In most cases that goes unnoticed as one of the other cpus has usually a timer pending which keeps the broadcast device armed with a short timeout. Now if the only cpu which has a short timer active has the bit set then the broadcast device will not be armed on behalf of that cpu and will fire way after the expected timer expiry. In the case of Christians bug report it took ~145 seconds which is about half of the wrap around time of HPET (the limit for that device) due to the fact that all other cpus had no timers armed which expired before the 145 seconds timeframe. The solution is simply to clear the broadcast active bit unconditionally when a cpu switches to oneshot mode after the first cpu switched the broadcast device over. It's not idle at that point otherwise it would not be executing that code. [ I fundamentally hate that broadcast crap. Why the heck thought some folks that when going into deep idle it's a brilliant concept to switch off the last device which brings the cpu back from that state? ] Thanks to Christian for providing all the valuable debug information! Reported-and-tested-by: Christian Hoffmann <email@christianhoffmann.info> Cc: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1105161105170.3078%40ionos%3E Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-16 23:35:41 +02:00
Stephan Baerwolf	db670dac49	sched: Fix and optimise calculation of the weight-inverse If the inverse loadweight should be zero, function "calc_delta_mine" calculates the inverse of "lw->weight" (in 32bit integer ops). This calculation is actually a little bit impure (because it is inverting something around "lw-weight"+1), especially when "lw->weight" becomes smaller. The correct inverse would be 1/lw->weight multiplied by "WMULT_CONST" for fixcomma-scaling it into integers. (So WMULT_CONST/lw->weight ...) The old, impure algorithm took two divisions for inverting lw->weight, the new, more exact one only takes one and an additional unlikely-if. Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-0pz0wnyalr4tk4ln11xwumdx@git.kernel.org [ This could explain some aritmetical issues for small shares but nothing concrete has been reported yet so we are not confident enough to queue this up in sched/urgent and for -stable backport. But if anyone finds this commit and sees it to fix some badness then we can certainly change our mind! ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-16 11:01:18 +02:00
Yong Zhang	db44fc017d	sched: Avoid going ahead if ->cpus_allowed is not changed If cpumask_equal(&p->cpus_allowed, new_mask) is true, seems there is no reason to prevent set_cpus_allowed_ptr() return directly. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110509140705.GA2219@zhy Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-16 11:01:18 +02:00
Mike Galbraith	61eadef6a9	sched, rt: Update rq clock when unthrottling of an otherwise idle CPU If an RT task is awakened while it's rt_rq is throttled, the time between wakeup/enqueue and unthrottle/selection may be accounted as rt_time if the CPU is idle. Set rq->skip_clock_update negative upon throttle release to tell put_prev_task() that we need a clock update. Reported-by: Thomas Giesel <skoe@directbox.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1304059010.7472.1.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-16 11:01:17 +02:00
Cheng Xu	ec514c487c	sched: Fix rt_rq runtime leakage bug This patch is to fix the real-time scheduler bug reported at: https://lkml.org/lkml/2011/4/26/13 That is, when running multiple real-time threads on every logical CPUs and then turning off one CPU, the kernel will bug at function __disable_runtime(). Function __disable_runtime() bugs and reports leakage of rt_rq runtime. The root cause is __disable_runtime() assumes it iterates through all the existing rt_rq's while walking rq->leaf_rt_rq_list, which actually contains only runnable rt_rq's. This problem also applies to __enable_runtime() and print_rt_stats(). The patch is based on above analysis, appears to fix the problem, but is only lightly tested. Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Cheng Xu <chengxu@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4DCE1F12.6040609@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-16 11:00:54 +02:00
Serge E. Hallyn	47a150edc2	Cache user_ns in struct cred If !CONFIG_USERNS, have current_user_ns() defined to (&init_user_ns). Get rid of _current_user_ns. This requires nsown_capable() to be defined in capability.c rather than as static inline in capability.h, so do that. Request_key needs init_user_ns defined at current_user_ns if !CONFIG_USERNS, so forward-declare that in cred.h if !CONFIG_USERNS at current_user_ns() define. Compile-tested with and without CONFIG_USERNS. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> [ This makes a huge performance difference for acl_permission_check(), up to 30%. And that is one of the hottest kernel functions for loads that are pathname-lookup heavy. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-13 11:45:33 -07:00
Tejun Heo	19e274630c	job control: reorganize wait_task_stopped() wait_task_stopped() tested task_stopped_code() without acquiring siglock and, if stop condition existed, called wait_task_stopped() and directly returned the result. This patch moves the initial task_stopped_code() testing into wait_task_stopped() and make wait_consider_task() fall through to wait_task_continue() on 0 return. This is for the following two reasons. * Because the initial task_stopped_code() test is done without acquiring siglock, it may race against SIGCONT generation. The stopped condition might have been replaced by continued state by the time wait_task_stopped() acquired siglock. This may lead to unexpected failure of WNOHANG waits. This reorganization addresses this single race case but there are other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_* transitions. * Scheduled ptrace updates require changes to the initial test which would fit better inside wait_task_stopped(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-05-13 18:56:02 +02:00
Chris Metcalf	be84cb4383	compat: fixes to allow working with tile arch The existing <asm-generic/unistd.h> mechanism doesn't really provide enough to create the 64-bit "compat" ABI properly in a generic way, since the compat ABI is a mix of things were you can re-use the 64-bit versions of syscalls and things where you need a compat wrapper. To provide this in the most direct way possible, I added two new macros to go along with the existing __SYSCALL and __SC_3264 macros: __SC_COMP and SC_COMP_3264. These macros take an additional argument, typically a "compat_sys_xxx" function, which is passed to __SYSCALL if you define __SYSCALL_COMPAT when including the header, resulting in a pointer to the compat function being placed in the generated syscall table. The change also adds some missing definitions to <linux/compat.h> so that it actually has declarations for all the compat syscalls, since the "[nr] = ##call" approach requires proper C declarations for all the functions included in the syscall table. Finally, compat.c defines compat_sys_sigpending() and compat_sys_sigprocmask() even if the underlying architecture doesn't request it, which tries to pull in undefined compat_old_sigset_t defines. We need to guard those compat syscall definitions with appropriate __ARCH_WANT_SYS_xxx ifdefs. Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>	2011-05-12 15:51:36 -04:00
Samir Bellabes	3e51e3edfd	sched: Remove unused parameters from sched_fork() and wake_up_new_task() sched_fork() and wake_up_new_task() are defined with a parameter 'unsigned long clone_flags', which is unused. This patch removes the parameters. Signed-off-by: Samir Bellabes <sam@synack.fr> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1305130685-1047-1-git-send-email-sam@synack.fr Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-12 09:36:37 +02:00
Ingo Molnar	9cb5baba5e	Merge commit 'v2.6.39-rc7' into sched/core	2011-05-12 09:36:18 +02:00
Rafael J. Wysocki	2e711c04db	PM: Remove sysdev suspend, resume and shutdown operations Since suspend, resume and shutdown operations in struct sysdev_class and struct sysdev_driver are not used any more, remove them. Also drop sysdev_suspend(), sysdev_resume() and sysdev_shutdown() used for executing those operations and modify all of their users accordingly. This reduces kernel code size quite a bit and reduces its complexity. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-05-11 21:37:15 +02:00
Rafael J. Wysocki	36cb7035ea	PM / Hibernate: Fix ioctl SNAPSHOT_S2RAM The SNAPSHOT_S2RAM ioctl used for implementing the feature allowing one to suspend to RAM after creating a hibernation image is currently broken, because it doesn't clear the "ready" flag in the struct snapshot_data object handled by it. As a result, the SNAPSHOT_UNFREEZE doesn't work correctly after SNAPSHOT_S2RAM has returned and the user space hibernate task cannot thaw the other processes as appropriate. Make SNAPSHOT_S2RAM clear data->ready to fix this problem. Tested-by: Alexandre Felipe Muller de Souza <alexandrefm@mandriva.com.br> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@kernel.org	2011-05-11 21:10:58 +02:00
Rafael J. Wysocki	9744997a8a	PM / Hibernate: Make snapshot_release() restore GFP mask If the process using the hibernate user space interface closes /dev/snapshot after creating a hibernation image without thawing tasks, snapshot_release() should call pm_restore_gfp_mask() to restore the GFP mask used before the creation of the image. Make that happen. Tested-by: Alexandre Felipe Muller de Souza <alexandrefm@mandriva.com.br> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@kernel.org	2011-05-11 21:10:43 +02:00
Rafael J. Wysocki	87186475a4	PM: Fix warning in pm_restrict_gfp_mask() during SNAPSHOT_S2RAM ioctl A warning is printed by pm_restrict_gfp_mask() while the SNAPSHOT_S2RAM ioctl is being executed after creating a hibernation image, because pm_restrict_gfp_mask() has been called once already before the image creation and suspend_devices_and_enter() calls it once again. This happens after commit `452aa6999e` (mm/pm: force GFP_NOIO during suspend/hibernation and resume). To avoid this issue, move pm_restrict_gfp_mask() and pm_restore_gfp_mask() from suspend_devices_and_enter() to its caller in kernel/power/suspend.c. Reported-by: Alexandre Felipe Muller de Souza <alexandrefm@mandriva.com.br> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@kernel.org	2011-05-11 21:10:14 +02:00
Eric W. Biederman	34482e89a5	ns proc: Add support for the uts namespace Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2011-05-10 14:35:35 -07:00
Eric W. Biederman	0663c6f8fa	ns: Introduce the setns syscall With the networking stack today there is demand to handle multiple network stacks at a time. Not in the context of containers but in the context of people doing interesting things with routing. There is also demand in the context of containers to have an efficient way to execute some code in the container itself. If nothing else it is very useful ad a debugging technique. Both problems can be solved by starting some form of login daemon in the namespaces people want access to, or you can play games by ptracing a process and getting the traced process to do things you want it to do. However it turns out that a login daemon or a ptrace puppet controller are more code, they are more prone to failure, and generally they are less efficient than simply changing the namespace of a process to a specified one. Pieces of this puzzle can also be solved by instead of coming up with a general purpose system call coming up with targed system calls perhaps socketat that solve a subset of the larger problem. Overall that appears to be more work for less reward. int setns(int fd, int nstype); The fd argument is a file descriptor referring to a proc file of the namespace you want to switch the process to. In the setns system call the nstype is 0 or specifies an clone flag of the namespace you intend to change to prevent changing a namespace unintentionally. v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com> v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com> v4: Moved wiring up of the system call to another patch v5: Cleaned up the system call arguments - Changed the order. - Modified nstype to take the standard clone flags. v6: Added missing error handling as pointed out by Matt Helsley <matthltc@us.ibm.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2011-05-10 14:32:56 -07:00
Ingo Molnar	932fed4e2e	Merge commit 'v2.6.39-rc7' into perf/core Merge reason: pull in the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-10 17:05:45 +02:00
Tejun Heo	40ae717d1e	ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping() GROUP_STOP_TRAPPING waiting mechanism piggybacks on signal->wait_chldexit which is primarily used to implement waiting for wait(2) and friends. When do_wait() waits on signal->wait_chldexit, it uses a custom wake up callback, child_wait_callback(), which expects the child task which is waking up the parent to be passed in as @key to filter out spurious wakeups. task_clear_group_stop_trapping() used __wake_up_sync() which uses NULL @key causing the following oops if the parent was doing do_wait(). BUG: unable to handle kernel NULL pointer dereference at 00000000000002d8 IP: [<ffffffff810499f9>] child_wait_callback+0x29/0x80 PGD 1d899067 PUD 1e418067 PMD 0 Oops: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/local_cpus CPU 2 Modules linked in: Pid: 4498, comm: test-continued Not tainted 2.6.39-rc6-work+ #32 Bochs Bochs RIP: 0010:[<ffffffff810499f9>] [<ffffffff810499f9>] child_wait_callback+0x29/0x80 RSP: 0000:ffff88001b889bf8 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff88001fab3af8 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88001d91df20 RBP: ffff88001b889c08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: ffff88001fb70550 R14: 0000000000000000 R15: 0000000000000001 FS: 00007f26ccae4700(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000002d8 CR3: 000000001b8ac000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process test-continued (pid: 4498, threadinfo ffff88001b888000, task ffff88001fb88000) Stack: ffff88001b889c18 ffff88001fb70538 ffff88001b889c58 ffffffff810312f9 0000000000000001 0000000200000001 ffff88001b889c58 ffff88001fb70518 0000000000000002 0000000000000082 0000000000000001 0000000000000000 Call Trace: [<ffffffff810312f9>] __wake_up_common+0x59/0x90 [<ffffffff81035263>] __wake_up_sync_key+0x53/0x80 [<ffffffff810352a0>] __wake_up_sync+0x10/0x20 [<ffffffff8105a984>] task_clear_jobctl_trapping+0x44/0x50 [<ffffffff8105bcbc>] ptrace_stop+0x7c/0x290 [<ffffffff8105c20a>] do_signal_stop+0x28a/0x2d0 [<ffffffff8105d27f>] get_signal_to_deliver+0x14f/0x5a0 [<ffffffff81002175>] do_signal+0x75/0x7b0 [<ffffffff8100292d>] do_notify_resume+0x5d/0x70 [<ffffffff8182e36a>] retint_signal+0x46/0x8c Code: 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 8b 47 d8 83 f8 03 74 3a 85 c0 49 89 c8 75 23 89 c0 48 8b 5f e0 4c 8d 0c 40 31 c0 <4b> 39 9c c8 d8 02 00 00 74 1d 48 83 c4 08 5b c9 c3 66 0f 1f 44 Fix it by using __wake_up_sync_key() and passing in the child as @key. I still think it's a mistake to piggyback on wait_chldexit for this. Given the relative low frequency of ptrace use, we would be much better off leaving already complex wait_chldexit alone and using bit waitqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com>	2011-05-09 14:19:54 +02:00
Oleg Nesterov	2e4f7c7769	signal: sys_sigprocmask() needs retarget_shared_pending() sys_sigprocmask() changes current->blocked by hand. Convert this code to use set_current_blocked(). Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2011-05-09 13:48:56 +02:00
Lai Jiangshan	fa4bbc4ca5	perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu() The rcu callback swevent_hlist_release_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(swevent_hlist_release_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:51:09 -07:00
Lai Jiangshan	cb796ff338	perf,rcu: convert call_rcu(free_ctx) to kfree_rcu() The rcu callback free_ctx() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_ctx). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:51:08 -07:00
Lai Jiangshan	025cea99db	cgroup,rcu: convert call_rcu(__free_css_id_cb) to kfree_rcu() The rcu callback __free_css_id_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(__free_css_id_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:50:47 -07:00
Lai Jiangshan	f2da1c40dc	cgroup,rcu: convert call_rcu(free_cgroup_rcu) to kfree_rcu() The rcu callback free_cgroup_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_cgroup_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:50:46 -07:00
Lai Jiangshan	30088ad815	cgroup,rcu: convert call_rcu(free_css_set_rcu) to kfree_rcu() The rcu callback free_css_set_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_css_set_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:50:45 -07:00
Paul E. McKenney	1217ed1ba5	rcu: permit rcu_read_unlock() to be called while holding runqueue locks Avoid calling into the scheduler while holding core RCU locks. This allows rcu_read_unlock() to be called while holding the runqueue locks, but only as long as there was no chance of the RCU read-side critical section having been preempted. (Otherwise, if RCU priority boosting is enabled, rcu_read_unlock() might call into the scheduler in order to unboost itself, which might allows self-deadlock on the runqueue locks within the scheduler.) Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-07 22:50:45 -07:00
Linus Torvalds	8b061610da	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Makefile: Use gcc to determine ARCH perf events, x86: Fix Intel Nehalem and Westmere last level cache event definitions hw_breakpoints, powerpc: Fix CONFIG_HAVE_HW_BREAKPOINT off-case in ptrace_set_debugreg() sh, hw_breakpoints: Fix racy access to ptrace breakpoints arm, hw_breakpoints: Fix racy access to ptrace breakpoints powerpc, hw_breakpoints: Fix racy access to ptrace breakpoints x86, hw_breakpoints: Fix racy access to ptrace breakpoints ptrace: Prepare to fix racy accesses on task breakpoints	2011-05-07 13:17:37 -07:00
Kay Sievers	b50fa7c807	reboot: disable usermodehelper to prevent fs access In case CONFIG_UEVENT_HELPER_PATH is not set to "", which it should be on every system, the kernel forks processes during shutdown, which try to access the rootfs, even when the binary does not exist. It causes exceptions and long delays in the disk driver, which gets read requests at the time it tries to shut down the disk. This patch disables all kernel-forked processes during reboot to allow a clean poweroff. Cc: Tejun Heo <htejun@gmail.com> Tested-By: Anton Guda <atu@dmeti.dp.ua> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-05-06 17:52:32 -07:00
Arjan van de Ven	a3a4a5acd3	Regression: partial revert "tracing: Remove lock_depth from event entry" This partially reverts commit `e6e1e25935`. That commit changed the structure layout of the trace structure, which in turn broke PowerTOP (1.9x generation) quite badly. I appreciate not wanting to expose the variable in question, and PowerTOP was not using it, so I've replaced the variable with just a padding field - that way if in the future a new field is needed it can just use this padding field. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-06 13:20:59 -07:00
Hillf Danton	7142d17e8f	sched: Shorten the construction of the span cpu mask of sched domain For a given node, when constructing the cpumask for its sched_domain to span, if there is no best node available after searching, further efforts could be saved, based on small change in the return value of find_next_best_node(). Signed-off-by: Hillf Danton <dhillf@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Yong Zhang <yong.zhang0@gmail.com> Link: http://lkml.kernel.org/r/BANLkTi%3DqPWxRAa6%2BdT3ohEP6Z%3D0v%2Be4EXA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-06 09:13:05 +02:00
Rakib Mullick	4934a4d3d3	sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG cfs_rq->nr_spread_over is only used when CONFIG_SCHED_DEBUG is set. So wrap it with CONFIG_SCHED_DEBUG. Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1304528026.15681.3.camel@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-06 09:04:19 +02:00
Gleb Natapov	29ce831000	rcu: provide rcu_virt_note_context_switch() function. Provide rcu_virt_note_context_switch() for vitalization use to note quiescent state during guest entry. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-05 23:16:59 -07:00
Paul E. McKenney	bad6e1393c	rcu: get rid of signed overflow in check_cpu_stall() Signed integer overflow is undefined by the C standard, so move calculations to unsigned. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-05 23:16:59 -07:00
Eric Dumazet	b554d7de8d	rcu: optimize rcutiny rcu_sched_qs() currently calls local_irq_save()/local_irq_restore() up to three times. Remove irq masking from rcu_qsctr_help() / invoke_rcu_kthread() and do it once in rcu_sched_qs() / rcu_bh_qs() This generates smaller code as well. text data bss dec hex filename 2314 156 24 2494 9be kernel/rcutiny.old.o 2250 156 24 2430 97e kernel/rcutiny.new.o Fix an outdated comment for rcu_qsctr_help() Move invoke_rcu_kthread() definition before its use. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:59 -07:00
Paul E. McKenney	2655d57ef3	rcu: prevent call_rcu() from diving into rcu core if irqs disabled This commit marks a first step towards making call_rcu() have real-time behavior. If irqs are disabled, don't dive into the RCU core. Later on, this new early exit will wake up the per-CPU kthread, which first must be modified to handle the cases involving callback storms. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:59 -07:00
Paul E. McKenney	baa1ae0c9f	rcu: further lower priority in rcu_yield() Although rcu_yield() dropped from real-time to normal priority, there is always the possibility that the competing tasks have been niced. So nice to 19 in rcu_yield() to help ensure that other tasks have a better chance of running. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:59 -07:00
Lai Jiangshan	9ab1544eb4	rcu: introduce kfree_rcu() Many rcu callbacks functions just call kfree() on the base structure. These functions are trivial, but their size adds up, and furthermore when they are used in a kernel module, that module must invoke the high-latency rcu_barrier() function at module-unload time. The kfree_rcu() function introduced by this commit addresses this issue. Rather than encoding a function address in the embedded rcu_head structure, kfree_rcu() instead encodes the offset of the rcu_head structure within the base structure. Because the functions are not allowed in the low-order 4096 bytes of kernel virtual memory, offsets up to 4095 bytes can be accommodated. If the offset is larger than 4095 bytes, a compile-time error will be generated in __kfree_rcu(). If this error is triggered, you can either fall back to use of call_rcu() or rearrange the structure to position the rcu_head structure into the first 4096 bytes. Note that the allowable offset might decrease in the future, for example, to allow something like kmem_cache_free_rcu(). The new kfree_rcu() function can replace code as follows: call_rcu(&p->rcu, simple_kfree_callback); where "simple_kfree_callback()" might be defined as follows: void simple_kfree_callback(struct rcu_head p) { struct foo q = container_of(p, struct foo, rcu); kfree(q); } with the following: kfree_rcu(&p->rcu, rcu); Note that the "rcu" is the name of a field in the structure being freed. The reason for using this rather than passing in a pointer to the base structure is that the above approach allows better type checking. This commit is based on earlier work by Lai Jiangshan and Manfred Spraul: Lai's V1 patch: http://lkml.org/lkml/2008/9/18/1 Manfred's patch: http://lkml.org/lkml/2009/1/2/115 Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:59 -07:00
Paul E. McKenney	6cc68793e3	rcu: fix spelling The "preemptible" spelling is preferable. May as well fix it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:59 -07:00
Lai Jiangshan	13491a0ee1	rcu: call __rcu_read_unlock() in exit_rcu for tree RCU Using __rcu_read_lock() in place of rcu_read_lock() leaves any debug state as it really should be, namely with the lock still held. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:58 -07:00
Paul E. McKenney	7e8b4c7234	rcu: Converge TINY_RCU expedited and normal boosting This applies a trick from TREE_RCU boosting to TINY_RCU, eliminating code and adding comments. The key point is that it is possible for the booster thread itself to work out whether there is a normal or expedited boost required based solely on local information. There is therefore no need for boost initiation to know or care what type of boosting is required. In addition, when boosting is complete for a given grace period, then by definition there cannot be any more boosting for that grace period. This allows eliminating yet more state and statistics. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:58 -07:00
Paul E. McKenney	203373c81b	rcu: remove useless ->boosted_this_gp field The ->boosted_this_gp field is a holdover from an earlier design that was to carry out multiple boost operations in parallel. It is not required by the current design, which boosts one task at a time. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-05 23:16:58 -07:00
Paul E. McKenney	ddeb75814f	rcu: code cleanups in TINY_RCU priority boosting. Extraneous semicolon, bad comment, and fold INIT_LIST_HEAD() into list_del() to get list_del_init(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:58 -07:00
Paul E. McKenney	f0a07aeaf8	rcu: Switch to this_cpu() primitives This removes a couple of lines from invoke_rcu_cpu_kthread(), improving readability. Reported-by: Christoph Lameter <cl@linux.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:57 -07:00
Paul E. McKenney	108aae2233	rcu: Use WARN_ON_ONCE for DEBUG_OBJECTS_RCU_HEAD warnings Avoid additional multiple-warning confusion in memory-corruption scenarios. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:57 -07:00
Paul E. McKenney	561190e3b3	rcu: mark rcutorture boosting callback as being on-stack The CONFIG_DEBUG_OBJECTS_RCU_HEAD facility requires that on-stack RCU callbacks be flagged explicitly to debug-objects using the init_rcu_head_on_stack() and destroy_rcu_head_on_stack() functions. This commit applies those functions to the rcutorture code that tests RCU priority boosting. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:57 -07:00
Mathieu Desnoyers	fc2ecf7ec7	rcu: Enable DEBUG_OBJECTS_RCU_HEAD from !PREEMPT The prohibition of DEBUG_OBJECTS_RCU_HEAD from !PREEMPT was due to the fixup actions. So just produce a warning from !PREEMPT. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:57 -07:00
Paul E. McKenney	5ece5bab3e	rcu: Add forward-progress diagnostic for per-CPU kthreads Increment a per-CPU counter on each pass through rcu_cpu_kthread()'s service loop, and add it to the rcudata trace output. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:57 -07:00
Paul E. McKenney	15ba0ba860	rcu: add grace-period age and more kthread state to tracing This commit adds the age in jiffies of the current grace period along with the duration in jiffies of the longest grace period since boot to the rcu/rcugp debugfs file. It also adds an additional "O" state to kthread tracing to differentiate between the kthread waiting due to having nothing to do on the one hand and waiting due to being on the wrong CPU on the other hand. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-05 23:16:56 -07:00
Paul E. McKenney	a9f4793d89	rcu: fix tracing bug thinko on boost-balk attribution The rcu_initiate_boost_trace() function mis-attributed refusals to initiate RCU priority boosting that were in fact due to its not yet being time to boost. This patch fixes the faulty comparison. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-05 23:16:56 -07:00
Paul E. McKenney	4a29865689	rcu: make rcutorture version numbers available through debugfs It is not possible to accurately correlate rcutorture output with that of debugfs. This patch therefore adds a debugfs file that prints out the rcutorture version number, permitting easy correlation. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:56 -07:00
Paul E. McKenney	d71df90ead	rcu: add tracing for RCU's kthread run states. Add tracing to help debugging situations when RCU's kthreads are not running but are supposed to be. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:56 -07:00
Paul E. McKenney	0ac3d136b2	rcu: add callback-queue information to rcudata output This commit adds an indication of the state of the callback queue using a string of four characters following the "ql=" integer queue length. The first character is "N" if there are callbacks that have been queued that are not yet ready to be handled by the next grace period, or "." otherwise. The second character is "R" if there are callbacks queued that are ready to be handled by the next grace period, or "." otherwise. The third character is "W" if there are callbacks waiting for the current grace period, or "." otherwise. Finally, the fourth character is "D" if there are callbacks that have been handled by a prior grace period and are waiting to be invoked, or ".". Note that callbacks that are in the process of being invoked are not shown. These callbacks would have been removed from the rcu_data structure's list by rcu_do_batch() prior to being executed. (These callbacks are also not reflected in the "ql=" total, FWIW.) Also, document the new callback-queue trace information. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:56 -07:00
Paul E. McKenney	0ea1f2ebeb	rcu: Add boosting to TREE_PREEMPT_RCU tracing Includes total number of tasks boosted, number boosted on behalf of each of normal and expedited grace periods, and statistics on attempts to initiate boosting that failed for various reasons. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:55 -07:00
Paul E. McKenney	67b98dba47	rcu: eliminate unused boosting statistics The n_rcu_torture_boost_allocerror and n_rcu_torture_boost_afferror statistics are not actually incremented anymore, so eliminate them. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:55 -07:00
Paul E. McKenney	3acf4a9a3d	rcu: avoid hammering sched with yet another bound RT kthread The scheduler does not appear to take kindly to having multiple real-time threads bound to a CPU that is going offline. So this commit is a temporary hack-around to avoid that happening. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-05 23:16:55 -07:00
Paul E. McKenney	e3995a25fa	rcu: put per-CPU kthread at non-RT priority during CPU hotplug operations If you are doing CPU hotplug operations, it is best not to have CPU-bound realtime tasks running CPU-bound on the outgoing CPU. So this commit makes per-CPU kthreads run at non-realtime priority during that time. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:55 -07:00
Paul E. McKenney	0f962a5e72	rcu: Force per-rcu_node kthreads off of the outgoing CPU The scheduler has had some heartburn in the past when too many real-time kthreads were affinitied to the outgoing CPU. So, this commit lightens the load by forcing the per-rcu_node and the boost kthreads off of the outgoing CPU. Note that RCU's per-CPU kthread remains on the outgoing CPU until the bitter end, as it must in order to preserve correctness. Also avoid disabling hardirqs across calls to set_cpus_allowed_ptr(), given that this function can block. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-05-05 23:16:55 -07:00
Paul E. McKenney	27f4d28057	rcu: priority boosting for TREE_PREEMPT_RCU Add priority boosting for TREE_PREEMPT_RCU, similar to that for TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST kernel parameter. The priority to which to boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter (defaulting to real-time priority 1) and the time to wait before boosting the readers who are blocking a given grace period is controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:55 -07:00
Paul E. McKenney	a26ac2455f	rcu: move TREE_RCU from softirq to kthread If RCU priority boosting is to be meaningful, callback invocation must be boosted in addition to preempted RCU readers. Otherwise, in presence of CPU real-time threads, the grace period ends, but the callbacks don't get invoked. If the callbacks don't get invoked, the associated memory doesn't get freed, so the system is still subject to OOM. But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit moves the callback invocations to a kthread, which can be boosted easily. Also add comments and properly synchronized all accesses to rcu_cpu_kthread_task, as suggested by Lai Jiangshan. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:54 -07:00
Paul E. McKenney	12f5f524ca	rcu: merge TREE_PREEPT_RCU blocked_tasks[] lists Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the rcu_node structure into a single ->blkd_tasks list with ->gp_tasks and ->exp_tasks tail pointers. This is in preparation for RCU priority boosting, which will add a third dimension to the combinatorial explosion in the ->blocked_tasks[] case, but simply a third pointer in the new ->blkd_tasks case. Also update documentation to reflect blocked_tasks[] merge Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:54 -07:00
Paul E. McKenney	e59fb3120b	rcu: Decrease memory-barrier usage based on semi-formal proof Commit `d09b62d` fixed grace-period synchronization, but left some smp_mb() invocations in rcu_process_callbacks() that are no longer needed, but sheer paranoia prevented them from being removed. This commit removes them and provides a proof of correctness in their absence. It also adds a memory barrier to rcu_report_qs_rsp() immediately before the update to rsp->completed in order to handle the theoretical possibility that the compiler or CPU might move massive quantities of code into a lock-based critical section. This also proves that the sheer paranoia was not entirely unjustified, at least from a theoretical point of view. In addition, the old dyntick-idle synchronization depended on the fact that grace periods were many milliseconds in duration, so that it could be assumed that no dyntick-idle CPU could reorder a memory reference across an entire grace period. Unfortunately for this design, the addition of expedited grace periods breaks this assumption, which has the unfortunate side-effect of requiring atomic operations in the functions that track dyntick-idle state for RCU. (There is some hope that the algorithms used in user-level RCU might be applied here, but some work is required to handle the NMIs that user-space applications can happily ignore. For the short term, better safe than sorry.) This proof assumes that neither compiler nor CPU will allow a lock acquisition and release to be reordered, as doing so can result in deadlock. The proof is as follows: 1. A given CPU declares a quiescent state under the protection of its leaf rcu_node's lock. 2. If there is more than one level of rcu_node hierarchy, the last CPU to declare a quiescent state will also acquire the ->lock of the next rcu_node up in the hierarchy, but only after releasing the lower level's lock. The acquisition of this lock clearly cannot occur prior to the acquisition of the leaf node's lock. 3. Step 2 repeats until we reach the root rcu_node structure. Please note again that only one lock is held at a time through this process. The acquisition of the root rcu_node's ->lock must occur after the release of that of the leaf rcu_node. 4. At this point, we set the ->completed field in the rcu_state structure in rcu_report_qs_rsp(). However, if the rcu_node hierarchy contains only one rcu_node, then in theory the code preceding the quiescent state could leak into the critical section. We therefore precede the update of ->completed with a memory barrier. All CPUs will therefore agree that any updates preceding any report of a quiescent state will have happened before the update of ->completed. 5. Regardless of whether a new grace period is needed, rcu_start_gp() will propagate the new value of ->completed to all of the leaf rcu_node structures, under the protection of each rcu_node's ->lock. If a new grace period is needed immediately, this propagation will occur in the same critical section that ->completed was set in, but courtesy of the memory barrier in #4 above, is still seen to follow any pre-quiescent-state activity. 6. When a given CPU invokes __rcu_process_gp_end(), it becomes aware of the end of the old grace period and therefore makes any RCU callbacks that were waiting on that grace period eligible for invocation. If this CPU is the same one that detected the end of the grace period, and if there is but a single rcu_node in the hierarchy, we will still be in the single critical section. In this case, the memory barrier in step #4 guarantees that all callbacks will be seen to execute after each CPU's quiescent state. On the other hand, if this is a different CPU, it will acquire the leaf rcu_node's ->lock, and will again be serialized after each CPU's quiescent state for the old grace period. On the strength of this proof, this commit therefore removes the memory barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp(). The effect is to reduce the number of memory barriers by one and to reduce the frequency of execution from about once per scheduling tick per CPU to once per grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:54 -07:00
Paul E. McKenney	a00e0d714f	rcu: Remove conditional compilation for RCU CPU stall warnings The RCU CPU stall warnings can now be controlled using the rcu_cpu_stall_suppress boot-time parameter or via the same parameter from sysfs. There is therefore no longer any reason to have kernel config parameters for this feature. This commit therefore removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE kernel config parameters. The RCU_CPU_STALL_TIMEOUT parameter remains to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter remains to allow task-stall information to be suppressed if desired. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-05 23:16:54 -07:00
Ingo Molnar	4d70230bb4	Merge branch 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into perf/urgent	2011-05-06 08:11:28 +02:00
Anton Blanchard	228e548e60	net: Add sendmmsg socket system call This patch adds a multiple message send syscall and is the send version of the existing recvmmsg syscall. This is heavily based on the patch by Arnaldo that added recvmmsg. I wrote a microbenchmark to test the performance gains of using this new syscall: http://ozlabs.org/~anton/junkcode/sendmmsg_test.c The test was run on a ppc64 box with a 10 Gbit network card. The benchmark can send both UDP and RAW ethernet packets. 64B UDP batch pkts/sec 1 804570 2 872800 (+ 8 %) 4 916556 (+14 %) 8 939712 (+17 %) 16 952688 (+18 %) 32 956448 (+19 %) 64 964800 (+20 %) 64B raw socket batch pkts/sec 1 1201449 2 1350028 (+12 %) 4 1461416 (+22 %) 8 `1513080` (+26 %) 16 1541216 (+28 %) 32 1553440 (+29 %) 64 1557888 (+30 %) We see a 20% improvement in throughput on UDP send and 30% on raw socket send. [ Add sparc syscall entries. -DaveM ] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-05 11:10:14 -07:00
Andi Kleen	7372b0b122	clockevents: Move C3 stop test outside lock Avoid taking broadcast_lock in the idle path for systems where the timer doesn't stop in C3. [ tglx: Removed the stale label and added comment ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Dave Kleikamp <dkleikamp@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: lenb@kernel.org Cc: paulmck@us.ibm.com Link: http://lkml.kernel.org/r/%3C20110504234806.GF2925%40one.firstfloor.org%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-05 17:32:13 +02:00
john stultz	e05b2efb82	clocksource: Install completely before selecting Christian Hoffmann reported that the command line clocksource override with acpi_pm timer fails: Kernel command line: <SNIP> clocksource=acpi_pm hpet clockevent registered Switching to clocksource hpet Override clocksource acpi_pm is not HRT compatible. Cannot switch while in HRT/NOHZ mode. The watchdog code is what enables CLOCK_SOURCE_VALID_FOR_HRES, but we actually end up selecting the clocksource before we enqueue it into the watchdog list, so that's why we see the warning and fail to switch to acpi_pm timer as requested. That's particularly bad when we want to debug timekeeping related problems in early boot. Put the selection call last. Reported-by: Christian Hoffmann <email@christianhoffmann.info> Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: stable@kernel.org # 32... Link: http://lkml.kernel.org/r/%3C1304558210.2943.24.camel%40work-vm%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-05 15:23:26 +02:00
Ingo Molnar	98bb318864	Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/urgent	2011-05-04 20:33:42 +02:00
Vladimir Davydov	931aeeda0d	sched: Remove unused 'this_best_prio arg' from balance_tasks() It's passed across multiple functions but is never really used, so remove it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1304447467-29200-1-git-send-email-vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-04 09:07:21 +02:00
Ingo Molnar	e7e7ee2eab	perf events: Clean up definitions and initializers, update copyrights Fix a few inconsistent style bits that were added over the past few months. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-yv4hwf9yhnzoada8pcpb3a97@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-04 08:49:24 +02:00
Thomas Gleixner	179eb03268	alarmtimer: Drop device refcount after rtc_open() class_find_device() takes a refcount on the rtc device. rtc_open() takes another one, so we can drop it after the rtc_open() call. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org>	2011-05-04 08:18:34 +02:00
Thomas Gleixner	ce788f930b	alarmtimer: Check return value of class_find_device() alarmtimer_late_init() uses class_find_device() to find a alarm capable rtc device. The match callback stores a pointer to the name in the char pointer handed in from the call site. alarmtimer_late_init() checks the char pointer for NULL, but the pointer is on the stack and not initialized to NULL before the call. So it can have random content when the match function did not identify a device, which leads to random access in the following rtc_open() call where the pointer is dereferenced Instead of relying on the char pointer, check the return value of class_find_device. If a device is found then the name pointer is valid as well. Reported-by: Ingo Molnar <mingo@elte.hu> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-04 08:18:17 +02:00
Borislav Petkov	48dbb6dc86	hw breakpoints: Move to kernel/events/ As part of the events sybsystem unification, relocate hw_breakpoint.c into its new destination. Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-05-03 15:26:43 +02:00
Borislav Petkov	fae85b7c8b	perf: Start the restructuring mv kernel/perf_event.c -> kernel/events/core.c. From there, all further sensible splitting can happen. The idea is that due to perf_event.c becoming pretty sizable and with the advent of the marriage with ftrace, splitting functionality into its logical parts should help speeding up the unification and to manage the complexity of the subsystem. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2011-05-03 12:59:43 +02:00
Thomas Gleixner	99ee5315da	timerfd: Allow timers to be cancelled when clock was set Some applications must be aware of clock realtime being set backward. A simple example is a clock applet which arms a timer for the next minute display. If clock realtime is set backward then the applet displays a stale time for the amount of time which the clock was set backwards. Due to that applications poll the time because we don't have an interface. Extend the timerfd interface by adding a flag which puts the timer onto a different internal realtime clock. All timers on this clock are expired whenever the clock was set. The timerfd core records the monotonic offset when the timer is created. When the timer is armed, then the current offset is compared to the previous recorded offset. When it has changed, then timerfd_settime returns -ECANCELED. When a timer is read the offset is compared and if it changed -ECANCELED returned to user space. Periodic timers are not rearmed in the cancelation case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Chris Friesen <chris.friesen@genband.com> Tested-by: Kay Sievers <kay.sievers@vrfy.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davide Libenzi <davidel@xmailserver.org> Reviewed-by: Alexander Shishkin <virtuoso@slind.org> Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104271359580.3323%40ionos%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-02 21:39:15 +02:00
Thomas Gleixner	b12a03ce48	hrtimers: Prepare for cancel on clock was set timers Make clock_was_set() unconditional and rename hres_timers_resume to hrtimers_resume. This is a preparatory patch for hrtimers which are cancelled when clock realtime was set. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-02 21:37:58 +02:00
Mike Frysinger	942c3c5c32	hrtimer: Make lookup table const Signed-off-by: Mike Frysinger <vapier@gentoo.org> Link: http://lkml.kernel.org/r/%3C1304364267-14489-1-git-send-email-vapier%40gentoo.org%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-02 21:37:57 +02:00
Thomas Gleixner	3687a2c0d8	Merge branch 'linus' into timers/core Reason: Pick up the hrtimer_clock_to_base_table fix from mainline Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-02 21:37:08 +02:00
John Stultz	472647dcd7	timers: Fix alarmtimer build issues when CONFIG_RTC_CLASS=n Ingo pointed out that the alarmtimers won't build if CONFIG_RTC_CLASS=n. This patch adds proper ifdefs to the alarmtimer code to disable the rtc usage if it is not built in. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-02 21:36:57 +02:00
Geert Uytterhoeven	94b2c363dc	genirq: Fix typo CONFIG_GENIRC_IRQ_SHOW_LEVEL commit `ab7798ffcf` ("genirq: Expand generic show_interrupts()") added the Kconfig option GENERIC_IRQ_SHOW_LEVEL to accomodate PowerPC, but this doesn't actually enable the functionality due to a typo in the #ifdef check. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linux/PPC Development <linuxppc-dev@lists.ozlabs.org> Link: http://lkml.kernel.org/r/%3Calpine.DEB.2.00.1104302251370.19068%40ayla.of.borg%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-05-02 21:16:37 +02:00
Thomas Gleixner	c42321c76b	genirq: Make generic irq chip depend on CONFIG_GENERIC_IRQ_CHIP Only compile it in when there are users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org	2011-05-02 18:16:22 +02:00
Ingo Molnar	ac0a3260f3	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-05-01 19:11:42 +02:00
Ingo Molnar	809435ff4f	Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-05-01 19:09:39 +02:00
Linus Torvalds	3fd9952df4	Merge branch 'fixes-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'fixes-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix deadlock in worker_maybe_bind_and_lock() workqueue: Document debugging tricks Fix up trivial spelling conflict in kernel/workqueue.c	2011-04-30 09:15:40 -07:00
Steven Rostedt	b9df92d2a9	ftrace: Consolidate the function match routines for normal and mods The code used for matching functions is almost identical between normal selecting of functions and using the :mod: feature of set_ftrace_notrace. Consolidate the two users into one function. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:53:14 -04:00
Steven Rostedt	491d0dcfb9	ftrace: Consolidate updating of ftrace_trace_function There are three locations that perform almost identical functions in order to update the ftrace_trace_function (the ftrace function variable that gets called by mcount). Consolidate these into a single function called update_ftrace_function(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:53:11 -04:00
Steven Rostedt	996e87be7f	ftrace: Move record update for normal and modules into a separate function The updating of a function record is moved to a single function. This will allow us to add specific changes in one location for both modules and kernel functions. Later patches will determine if the function record itself needs to be updated (which enables the mcount caller), or just the ftrace_ops needs the update. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:53:08 -04:00
Steven Rostedt	d2c8c3eafb	ftrace: Remove FTRACE_FL_CONVERTED flag Since we disable all function tracer processing if we detect that a modification of a instruction had failed, we do not need to track that the record has failed. No more ftrace processing is allowed, and the FTRACE_FL_CONVERTED flag is pointless. The FTRACE_FL_CONVERTED flag was used to denote records that were successfully converted from mcount calls into nops. But if a single record fails, all of ftrace is disabled. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:53:04 -04:00
Steven Rostedt	45a4a2372b	ftrace: Remove FTRACE_FL_FAILED flag Since we disable all function tracer processing if we detect that a modification of a instruction had failed, we do not need to track that the record has failed. No more ftrace processing is allowed, and the FTRACE_FL_FAILED flag is pointless. Removing this flag simplifies some of the code, but some ftrace_disabled checks needed to be added or move around a little. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:53:01 -04:00
Steven Rostedt	3499e46114	ftrace: Remove failures file The failures file in the debugfs tracing directory would list the functions that failed to convert when the old dead ftrace daemon tried to update code but failed. Since this code is now dead along with the daemon the failures file is useless. Remove it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:52:58 -04:00
Steven Rostedt	8ab2b7efd3	ftrace: Remove unnecessary disabling of irqs The disabling of interrupts around ftrace_update_code() was used to protect against the evil ftrace daemon from years past. But that daemon has long been killed. It is safe to keep interrupts enabled while updating the initial mcount into nops. The ftrace_mutex is also held which keeps other users at bay. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:52:55 -04:00
Steven Rostedt	0778d9ad33	ftrace: Make FTRACE_WARN_ON() work in if condition Let FTRACE_WARN_ON() be used as a stand alone statement or inside a conditional: if (FTRACE_WARN_ON(x)) Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:52:52 -04:00
Steven Rostedt	058e297d34	ftrace: Only update the function code on write to filter files If function tracing is enabled, a read of the filter files will cause the call to stop_machine to update the function trace sites. It should only call stop_machine on write. Cc: stable@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-29 22:42:59 -04:00
Rafael J. Wysocki	85eb8c8d0b	PM / Runtime: Generic clock manipulation rountines for runtime PM (v6) Many different platforms and subsystems may want to disable device clocks during suspend and enable them during resume which is going to be done in a very similar way in all those cases. For this reason, provide generic routines for the manipulation of device clocks during suspend and resume. Convert the ARM shmobile platform to using the new routines. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-04-30 00:25:44 +02:00
Linus Torvalds	40a963502c	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, x86, nmi: Move LVT un-masking into irq handlers perf events, x86: Work around the Nehalem AAJ80 erratum perf, x86: Fix BTS condition ftrace: Build without frame pointers on Microblaze	2011-04-29 15:08:53 -07:00
Linus Torvalds	fcc4dc7151	Merge branch 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically rtc: max8925: Call dev_set_drvdata before rtc_device_register	2011-04-29 15:08:31 -07:00
Tejun Heo	5035b20fa5	workqueue: fix deadlock in worker_maybe_bind_and_lock() If a rescuer and stop_machine() bringing down a CPU race with each other, they may deadlock on non-preemptive kernel. The CPU won't accept a new task, so the rescuer can't migrate to the target CPU, while stop_machine() can't proceed because the rescuer is holding one of the CPU retrying migration. GCWQ_DISASSOCIATED is never cleared and worker_maybe_bind_and_lock() retries indefinitely. This problem can be reproduced semi reliably while the system is entering suspend. http://thread.gmane.org/gmane.linux.kernel/1122051 A lot of kudos to Thilo-Alexander for reporting this tricky issue and painstaking testing. stable: This affects all kernels with cmwq, so all kernels since and including v2.6.36 need this fix. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Cc: stable@kernel.org	2011-04-29 18:08:37 +02:00
Thomas Gleixner	ce31332d3c	hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically Sedat and Bruno reported RCU stalls which turned out to be caused by the following; sched_init() calls init_rt_bandwidth() which calls hrtimer_init() _BEFORE_ hrtimers_init() is called. While not entirely correct this worked because hrtimer_init() only accessed statically initialized data (hrtimer_bases.clock_base[CLOCK_MONOTONIC]) Commit `e06383db9` (hrtimers: extend hrtimer base code to handle more then 2 clockids) added an indirection to the hrtimer_bases.clock_base lookup to avoid gap handling in the hot path. The table which is used for the translataion from CLOCK_ID to HRTIMER_BASE index is initialized at runtime in hrtimers_init(). So the early call of the scheduler code translates CLOCK_MONOTONIC to HRTIMER_BASE_REALTIME. Thus the rt_bandwith timer ends up on CLOCK_REALTIME. If the timer is armed and the wall clock time is set (e.g. ntpdate in the early boot process - which also gives the problem deterministic behaviour i.e. magic recovery after N hours), then the timer ends up with an expiry time far into the future. That breaks the RT throttler mechanism as rt runtime is accumulated and never cleared, so the rt throttler detects a false cpu hog condition and blocks all RT tasks until the timer finally expires. That in turn stalls the RCU thread of TINYRCU which leads to an huge amount of RCU callbacks piling up. Make the translation table statically initialized, so we are back to the status of <= 2.6.39. Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Bruno Prémont <bonbons@linux-vserver.org> Cc: John stultz <johnstul@us.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104282353140.3005%40ionos%3E Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-04-29 10:57:11 +02:00
John Stultz	7068b7a162	timers: Remove delayed irqwork from alarmtimers implementation Thomas asked about the delayed irq work in the alarmtimers code, and I realized that it was a legacy from when the alarmtimer base lock was a mutex (due to concerns that we'd be interacting with the RTC device, which is protected by mutexes). Since the alarmtimer base is now protected by a spinlock, we can simply execute alarmtimer functions directly from the hrtimer callback. Should any future alarmtimer functions sleep, they can simply manage scheduling any delayed work themselves. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-04-28 13:39:18 -07:00
John Stultz	180bf812ce	timers: Improve alarmtimer comments and minor fixes This patch addresses a number of minor comment improvements and other minor issues from Thomas' review of the alarmtimers code. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-04-28 13:39:17 -07:00
Hillf Danton	1409f141ac	kernel/watchdog.c: disable nmi perf event in the error path of enabling watchdog In corner cases where softlockup watchdog is not setup successfully, the relevant nmi perf event for hardlockup watchdog could be disabled, then the status of the underlying hardware remains unchanged. Also, if the kthread doesn't start then the hrtimer won't run and the hardlockup detector will falsely fire. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-28 11:28:21 -07:00
Oleg Nesterov	b013c39924	signal: cleanup sys_sigprocmask() Cleanup. Remove the unneeded goto's, we can simply read blocked.sig[0] unconditionally and then copy-to-user it if oset != NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>	2011-04-28 13:01:40 +02:00
Oleg Nesterov	702a5073fd	signal: rename signandsets() to sigandnsets() As Tejun and Linus pointed out, "nand" is the wrong name for "x & ~y", it should be "andn". Rename signandsets() as suggested. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-04-28 13:01:39 +02:00
Oleg Nesterov	b182801ab3	signal: do_sigtimedwait() needs retarget_shared_pending() do_sigtimedwait() changes current->blocked and thus it needs set_current_blocked()->retarget_shared_pending(). We could use set_current_blocked() directly. It is fine to change ->real_blocked from all-zeroes to ->blocked and vice versa lockless, but this is not immediately clear, looks racy, and needs a huge comment to explain why this is correct. To keep the things simple this patch adds the new static helper, __set_task_blocked() which should be called with ->siglock held. This way we can change both ->real_blocked and ->blocked atomically under ->siglock as the current code does. This is more understandable. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>	2011-04-28 13:01:39 +02:00
Oleg Nesterov	943df1485a	signal: introduce do_sigtimedwait() to factor out compat/native code Factor out the common code in sys_rt_sigtimedwait/compat_sys_rt_sigtimedwait to the new helper, do_sigtimedwait(). Add the comment to document the extra tick we add to timespec_to_jiffies(ts), thanks to Linus who explained this to me. Perhaps it would be better to move compat_sys_rt_sigtimedwait() into signal.c under CONFIG_COMPAT, then we can make do_sigtimedwait() static. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>	2011-04-28 13:01:38 +02:00
Oleg Nesterov	fe0faa005d	signal: sys_rt_sigtimedwait: simplify the timeout logic No functional changes, cleanup compat_sys_rt_sigtimedwait() and sys_rt_sigtimedwait(). Calculate the timeout before we take ->siglock, this simplifies and lessens the code. Use timespec_valid() to check the timespec. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>	2011-04-28 13:01:38 +02:00
Oleg Nesterov	bb7efee2ca	signal: cleanup sys_rt_sigprocmask() sys_rt_sigprocmask() looks unnecessarily complicated, simplify it. We can just read current->blocked lockless unconditionally before anything else and then copy-to-user it if needed. At worst we copy 4 words on mips. We could copy-to-user the old mask first and simplify the code even more, but the patch tries to keep the current behaviour: we change current->block even if copy_to_user(oset) fails. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-04-28 13:01:38 +02:00
Oleg Nesterov	e6fa16ab9c	signal: sigprocmask() should do retarget_shared_pending() In short, almost every changing of current->blocked is wrong, or at least can lead to the unexpected results. For example. Two threads T1 and T2, T1 sleeps in sigtimedwait/pause/etc. kill(tgid, SIG) can pick T2 for TIF_SIGPENDING. If T2 calls sigprocmask() and blocks SIG before it notices the pending signal, nobody else can handle this pending shared signal. I am not sure this is bug, but at least this looks strange imho. T1 should not sleep forever, there is a signal which should wake it up. This patch moves the code which actually changes ->blocked into the new helper, set_current_blocked() and changes this code to call retarget_shared_pending() as exit_signals() does. We should only care about the signals we just blocked, we use "newset & ~current->blocked" as a mask. We do not check !sigisemptyset(newblocked), retarget_shared_pending() is cheap unless mask & shared_pending. Note: for this particular case we could simply change sigprocmask() to return -EINTR if signal_pending(), but then we should change other callers and, more importantly, if we need this fix then set_current_blocked() will have more callers and some of them can't restart. See the next patch as a random example. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-04-28 13:01:37 +02:00
Oleg Nesterov	73ef4aeb61	signal: sigprocmask: narrow the scope of ->siglock No functional changes, preparation to simplify the review of the next change. 1. We can read current->block lockless, nobody else can ever change this mask. 2. Calculate the resulting sigset_t outside of ->siglock into the temporary variable, then take ->siglock and change ->blocked. Also, kill the stale comment about BKL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-04-28 13:01:36 +02:00
Oleg Nesterov	fec9993db0	signal: retarget_shared_pending: optimize while_each_thread() loop retarget_shared_pending() blindly does recalc_sigpending_and_wake() for every sub-thread, this is suboptimal. We can check t->blocked and stop looping once every bit in shared_pending has the new target. Note: we do not take task_is_stopped_or_traced(t) into account, we are not trying to speed up the signal delivery or to avoid the unnecessary (but harmless) signal_wake_up(0) in this unlikely case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-04-28 13:01:35 +02:00
Oleg Nesterov	f646e227b8	signal: retarget_shared_pending: consider shared/unblocked signals only exit_signals() checks signal_pending() before retarget_shared_pending() but this is suboptimal. We can avoid the while_each_thread() loop in case when there are no shared signals visible to us. Add the "shared_pending.signal & ~blocked" check. We don't use tsk->blocked directly but pass ~blocked as an argument, this is needed for the next patch. Note: we can optimize this more. while_each_thread(t) can check t->blocked into account and stop after every pending signal has the new target, see the next patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-04-28 13:01:35 +02:00
Oleg Nesterov	0edceb7bcd	signal: introduce retarget_shared_pending() No functional changes. Move the notify-other-threads code from exit_signals() to the new helper, retarget_shared_pending(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org>	2011-04-28 13:01:35 +02:00
Jeff Mahoney	e11feaa119	watchdog, hung_task_timeout: Add Kconfig configurable default This patch allows the default value for sysctl_hung_task_timeout_secs to be set at build time. The feature carries virtually no overhead, so it makes sense to keep it enabled. On heavily loaded systems, though, it can end up triggering stack traces when there is no bug other than the system being underprovisioned. We use this patch to keep the hung task facility available but disabled at boot-time. The default of 120 seconds is preserved. As a note, commit `e162b39a` may have accidentally reverted commit `fb822db4`, which raised the default from 120 seconds to 480 seconds. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Acked-by: Mandeep Singh Baines <msb@google.com> Link: http://lkml.kernel.org/r/4DB8600C.8080000@suse.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-28 09:13:17 +02:00
Tony Jones	f562988350	audit: acquire creds selectively to reduce atomic op overhead Commit `c69e8d9c01` ("CRED: Use RCU to access another task's creds and to release a task's own creds") added calls to get_task_cred and put_cred in audit_filter_rules. Profiling with a large number of audit rules active on the exit chain shows that we are spending upto 48% in this routine for syscall intensive tests, most of which is in the atomic ops. 1. The code should be accessing tsk->cred rather than tsk->real_cred. 2. Since tsk is current (or tsk is being created by copy_process) access to tsk->cred without rcu read lock is possible. At the request of the audit maintainer, a new flag has been added to audit_filter_rules in order to make this explicit and guide future code. Signed-off-by: Tony Jones <tonyj@suse.de> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-04-27 15:11:03 +02:00
Ingo Molnar	32673822e4	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core Conflicts: include/linux/perf_event.h Merge reason: pick up the latest jump-label enhancements, they are cooked ready. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-27 10:40:21 +02:00
Ingo Molnar	6c8a721327	Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent	2011-04-27 10:31:29 +02:00
John Stultz	9a7adcf5c6	timers: Posix interface for alarm-timers This patch exposes alarm-timers to userland via the posix clock and timers interface, using two new clockids: CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM. Both clockids behave identically to CLOCK_REALTIME and CLOCK_BOOTTIME, respectively, but timers set against the _ALARM suffixed clockids will wake the system if it is suspended. Some background can be found here: https://lwn.net/Articles/429925/ The concept for Alarm-timers was inspired by the Android Alarm driver (by Arve Hjønnevåg) found in the Android kernel tree. See: http://android.git.kernel.org/?p=kernel/common.git;a=blob;f=drivers/rtc/alarm.c;h=1250edfbdf3302f5e4ea6194847c6ef4bb7beb1c;hb=android-2.6.36 While the in-kernel interface is pretty similar between alarm-timers and Android alarm driver, the user-space interface for the Android alarm driver is via ioctls to a new char device. As mentioned above, I've instead chosen to export this functionality via the posix interface, as it seemed a little simpler and avoids creating duplicate interfaces to things like CLOCK_REALTIME and CLOCK_MONOTONIC under alternate names (ie:ANDROID_ALARM_RTC and ANDROID_ALARM_SYSTEMTIME). The semantics of the Android alarm driver are different from what this posix interface provides. For instance, threads other then the thread waiting on the Android alarm driver are able to modify the alarm being waited on. Also this interface does not allow the same wakelock semantics that the Android driver provides (ie: kernel takes a wakelock on RTC alarm-interupt, and holds it through process wakeup, and while the process runs, until the process either closes the char device or calls back in to wait on a new alarm). One potential way to implement similar semantics may be via the timerfd infrastructure, but this needs more research. There may also need to be some sort of sysfs system level policy hooks that allow alarm timers to be disabled to keep them from firing at inappropriate times (ie: laptop in a well insulated bag, mid-flight). CC: Arve Hjønnevåg <arve@android.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alessandro Zummo <a.zummo@towertech.it> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-04-26 14:01:46 -07:00
John Stultz	ff3ead96d1	timers: Introduce in-kernel alarm-timer interface This provides the in kernel interface and infrastructure for alarm-timers. Alarm-timers are a hybrid style timer, similar to hrtimers, but when the system is suspended, the RTC device is set to fire and wake the system for when the soonest alarm-timer expires. The concept for Alarm-timers was inspired by the Android Alarm driver (by Arve Hjønnevåg) found in the Android kernel tree. See: http://android.git.kernel.org/?p=kernel/common.git;a=blob;f=drivers/rtc/alarm.c;h=1250edfbdf3302f5e4ea6194847c6ef4bb7beb1c;hb=android-2.6.36 This in-kernel interface should be fairly compatible with the Android alarm driver in-kernel interface, but has the advantage of utilizing the new RTC timerqueue code instead of doing direct RTC manipulation. CC: Arve Hjønnevåg <arve@android.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alessandro Zummo <a.zummo@towertech.it> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-04-26 14:01:44 -07:00
John Stultz	304529b1b6	time: Add timekeeping_inject_sleeptime Some platforms cannot implement read_persistent_clock, as their RTC devices are only accessible when interrupts are enabled. This keeps them from being used by the timekeeping code on resume to measure the time in suspend. The RTC layer tries to work around this, by calling do_settimeofday on resume after irqs are reenabled to set the time properly. However, this only corrects CLOCK_REALTIME, and does not properly adjust the sleep time value. This causes btime in /proc/stat to be incorrect as well as making the new CLOCK_BOTTTIME inaccurate. This patch resolves the issue by introducing a new timekeeping hook to allow the RTC layer to inject the sleep time on resume. The code also checks to make sure that read_persistent_clock is nonfunctional before setting the sleep time, so that should the RTC's HCTOSYS option be configured in on a system that does support read_persistent_clock we will not increase the total_sleep_time twice. CC: Arve Hjønnevåg <arve@android.com> CC: Thomas Gleixner <tglx@linutronix.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-04-26 14:01:41 -07:00
Hillf Danton	1437f5bca3	sched: Remove noop in alloc_rt_sched_group() The rq varible, though computed for each possible cpu, has nothing to do in the function, so it can be removed. This also eliminates a build warning. Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTin-FfQfqW5ym1iuEmrk8s777Y1LAg@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-26 13:34:08 +02:00
Jiri Kosina	07f9479a40	Merge branch 'master' into for-next Fast-forwarded to current state of Linus' tree as there are patches to be applied for files that didn't exist on the old branch.	2011-04-26 10:22:59 +02:00
Roland Vossen	7816c45bf1	modules: Enabled dynamic debugging for staging modules Driver modules from the staging directory are marked 'tainted' by module.c. Subsequently, tainted modules are denied dynamic debugging. This is unwanted behavior, since staging modules should be able to use the dynamic debugging mechanism. Please merge this also into the staging-linus branch. Signed-off-by: Roland Vossen <rvossen@broadcom.com> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-04-25 16:45:22 -07:00
Jonathan Cameron	e7e09cd667	params.c: Use new strtobool function to process boolean inputs No functional changes. Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-04-25 16:04:52 -07:00
Frederic Weisbecker	bf26c01849	ptrace: Prepare to fix racy accesses on task breakpoints When a task is traced and is in a stopped state, the tracer may execute a ptrace request to examine the tracee state and get its task struct. Right after, the tracee can be killed and thus its breakpoints released. This can happen concurrently when the tracer is in the middle of reading or modifying these breakpoints, leading to dereferencing a freed pointer. Hence, to prepare the fix, create a generic breakpoint reference holding API. When a reference on the breakpoints of a task is held, the breakpoints won't be released until the last reference is dropped. After that, no more ptrace request on the task's breakpoints can be serviced for the tracer. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Will Deacon <will.deacon@arm.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: v2.6.33.. <stable@kernel.org> Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com	2011-04-25 17:28:24 +02:00
Jonathan Corbet	625f2a378e	sched: Get rid of lock_depth Neil Brown pointed out that lock_depth somehow escaped the BKL removal work. Let's get rid of it now. Note that the perf scripting utilities still have a bunch of code for dealing with common_lock_depth in tracepoints; I have left that in place in case anybody wants to use that code with older kernels. Suggested-by: Neil Brown <neilb@suse.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-24 13:18:38 +02:00
Linus Torvalds	686c4cbb10	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Add missing syscore_suspend() and syscore_resume() calls PM: Fix error code paths executed after failing syscore_suspend()	2011-04-23 22:35:16 -07:00
Thomas Gleixner	cfefd21e69	genirq: Add chip suspend and resume callbacks These callbacks are only called in the syscore suspend/resume code on interrupt chips which have been registered via the generic irq chip mechanism. Calling those callbacks per irq would be rather icky, but with the generic irq chip mechanism we can call this per registered chip. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org	2011-04-23 15:56:24 +02:00
Thomas Gleixner	7d82806247	genirq: Implement a generic interrupt chip Implement a generic interrupt chip, which is configurable and is able to handle the most common irq chip implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Tested-by: H Hartley Sweeten <hsweeten@visionengravers.com> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by; Kevin Hilman <khilman@ti.com>	2011-04-23 15:56:24 +02:00
Paul Mundt	7f1b1244e1	genirq: Support per-IRQ thread disabling. This adds support for disabling threading on a per-IRQ basis via the IRQ status instead of the IRQ flow, which is necessary for interrupts that don't follow the natural IRQ flow channels, such as those that are virtually created. The new APIs added are simply: irq_set_thread() irq_set_nothread() which follow the rest of the IRQ status routines. Chained handlers also have IRQ_NOTHREAD set on them automatically, making the lack of threading explicit rather than implicit. Subsequently, the nothread flag can be viewed through the standard genirq debugging facilities. [ tglx: Fixed cleanup fallout ] Signed-off-by: Paul Mundt <lethal@linux-sh.org> Link: http://lkml.kernel.org/r/%3C20110406210135.GF18426%40linux-sh.org%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-04-23 15:56:24 +02:00
Steven Rostedt	e0944ee63f	lockdep: Remove cmpxchg to update nr_chain_hlocks For some reason nr_chain_hlocks is updated with cmpxchg, but this is performed inside of the lockdep global "grab_lock()", which also makes simple modification of this variable atomic. Remove the cmpxchg logic for updating nr_chain_hlocks and simplify the code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110421014300.727863282@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-22 11:06:59 +02:00
Steven Rostedt	282b5c2f6f	lockdep: Print a nicer description for simple irq lock inversions Lockdep output can be pretty cryptic, having nicer output can save a lot of head scratching. When a simple irq inversion scenario is detected by lockdep (lock A taken in interrupt context but also in thread context without disabling interrupts) we now get the following (hopefully more informative) output: other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(lockA); <Interrupt> lock(lockA); * DEADLOCK * Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110421014300.436140880@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-22 11:06:59 +02:00
Steven Rostedt	6be8c3935b	lockdep: Replace "Bad BFS generated tree" message with something less cryptic The message of "Bad BFS generated tree" is a bit confusing. Replace it with a more sane error message. Thanks to Peter Zijlstra for helping me come up with a better message. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110421014300.135521252@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-22 11:06:59 +02:00
Steven Rostedt	dad3d7435e	lockdep: Print a nicer description for irq inversion bugs Irq inversion and irq dependency bugs are only subtly different. The diffenerence lies where the interrupt occurred. For irq dependency: irq_disable lock(A) lock(B) unlock(B) unlock(A) irq_enable lock(B) unlock(B) <interrupt> lock(A) The interrupt comes in after it has been established that lock A can be held when taking an irq unsafe lock. Lockdep detects the problem when taking lock A in interrupt context. With the irq_inversion the irq happens before it is established and lockdep detects the problem with the taking of lock B: <interrupt> lock(A) irq_disable lock(A) lock(B) unlock(B) unlock(A) irq_enable lock(B) unlock(B) Since the problem with the locking logic for both of these issues is in actuality the same, they both should report the same scenario. This patch implements that and prints this: other info that might help us debug this: Chain exists of: &rq->lock --> lockA --> lockC Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(lockC); local_irq_disable(); lock(&rq->lock); lock(lockA); <Interrupt> lock(&rq->lock); * DEADLOCK * Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110421014259.910720381@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-22 11:06:58 +02:00
Steven Rostedt	48702ecf30	lockdep: Print a nicer description for simple deadlocks Lockdep output can be pretty cryptic, having nicer output can save a lot of head scratching. When a simple deadlock scenario is detected by lockdep (lock A -> lock A) we now get the following new output: other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(lock)->rlock); lock(&(lock)->rlock); * DEADLOCK * Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110421014259.643930104@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-22 11:06:58 +02:00
Steven Rostedt	f4185812aa	lockdep: Print a nicer description for normal deadlocks The lockdep output can be pretty cryptic, having nicer output can save a lot of head scratching. When a normal deadlock scenario is detected by lockdep (lock A -> lock B and there exists a place where lock B -> lock A) we now get the following new output: other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(lockB); lock(lockA); lock(lockB); lock(lockA); * DEADLOCK * On cases where there's a deeper chair, it shows the partial chain that can cause the issue: Chain exists of: lockC --> lockA --> lockB Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(lockB); lock(lockA); lock(lockB); lock(lockC); * DEADLOCK * Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110421014259.380621789@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-22 11:06:57 +02:00
Steven Rostedt	3003eba313	lockdep: Print a nicer description for irq lock inversions Locking order inversion due to interrupts is a subtle problem. When an irq lockiinversion discovered by lockdep it currently reports something like: [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] ... and then prints out the locks that are involved, as back traces. Judging by lkml feedback developers were routinely confused by what a HARDIRQ->safe to unsafe issue is all about, and sometimes even blew it off as a bug in lockdep. It is not obvious when lockdep prints this message about a lock that is never taken in interrupt context. After explaining the problems that lockdep is reporting, I decided to add a description of the problem in visual form. Now the following is shown: --- other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(lockA); local_irq_disable(); lock(&rq->lock); lock(lockA); <Interrupt> lock(&rq->lock); * DEADLOCK * --- The above is the case when the unsafe lock is taken while holding a lock taken in irq context. But when a lock is taken that also grabs a unsafe lock, the call chain is shown: --- other info that might help us debug this: Chain exists of: &rq->lock --> lockA --> lockC Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(lockC); local_irq_disable(); lock(&rq->lock); lock(lockA); <Interrupt> lock(&rq->lock); * DEADLOCK * Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110421014259.132728798@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-22 11:06:57 +02:00
Michal Simek	d20ac25282	ftrace: Build without frame pointers on Microblaze Microblaze doesn't need/support FRAME_POINTERS in order to have a working function tracer. The patch remove Kconfig warning. Warning log: warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP && FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies (DEBUG_KERNEL && (CRIS \|\| M68K \|\| FRV \|\| UML \|\| AVR32 \|\| SUPERH \|\| BLACKFIN \|\| MN10300) \|\| ARCH_WANT_FRAME_POINTERS) Signed-off-by: Michal Simek <monstr@monstr.eu> Link: http://lkml.kernel.org/r/1301908812-8119-2-git-send-email-monstr@monstr.eu CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-21 09:06:24 -04:00
Rakib Mullick	d3bf52e998	sched: Remove obsolete comment from scheduler_tick() scheduler_tick() is no longer called by fork code - this got discarded a long time ago by commit `bc947631d1` ("sched: improve efficiency of sched_fork()"). So, remove the comment which still claims otherwise. Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTimO4iGP0QpaHO1HHF1QOnVcQpc0cw@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-21 11:41:36 +02:00
Ingo Molnar	42ac9e87fd	Merge commit 'v2.6.39-rc4' into sched/core Merge reason: Pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-21 11:39:28 +02:00
Ludwig Nussel	088ab0b4d8	kernel/ksysfs.c: expose file_caps_enabled in sysfs A kernel booted with no_file_caps allows to install fscaps on a binary but doesn't actually honor the fscaps when running the binary. Userspace currently has no sane way to determine whether installing fscaps actually has any effect. Since parsing /proc/cmdline is fragile this patch exposes the current setting (1 or 0) via /sys/kernel/fscaps Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-04-19 16:45:51 -07:00
Rafael J. Wysocki	19234c0819	PM: Add missing syscore_suspend() and syscore_resume() calls Device suspend/resume infrastructure is used not only by the suspend and hibernate code in kernel/power, but also by APM, Xen and the kexec jump feature. However, commit `40dc166cb5` (PM / Core: Introduce struct syscore_ops for core subsystems PM) failed to add syscore_suspend() and syscore_resume() calls to that code, which generally leads to breakage when the features in question are used. To fix this problem, add the missing syscore_suspend() and syscore_resume() calls to arch/x86/kernel/apm_32.c, kernel/kexec.c and drivers/xen/manage.c. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>	2011-04-20 00:36:11 +02:00
Linus Torvalds	4ae0ff16ef	Merge branch 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: RTC: rtc-omap: Fix a leak of the IRQ during init failure posix clocks: Replace mutex with reader/writer semaphore	2011-04-19 10:56:46 -07:00
James Morris	d4ab4e6a23	Merge branch 'master'; commit 'v2.6.39-rc3' into next	2011-04-19 21:32:41 +10:00
Peter Zijlstra	057f3fadb3	sched: Fix sched_domain iterations vs. RCU Vladis Kletnieks reported a new RCU debug warning in the scheduler. Since commit `dce840a087` ("sched: Dynamically allocate sched_domain/ sched_group data-structures") the sched_domain trees are protected by RCU instead of RCU-sched. This means that we need to include rcu_read_lock() protection when we iterate them since disabling preemption doesn't suffice anymore. Reported-by: Valdis.Kletnieks@vt.edu Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1302882741.2388.241.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-19 10:56:54 +02:00
Venkatesh Pallipadi	2f36825b17	sched: Next buddy hint on sleep and preempt path When a task in a taskgroup sleeps, pick_next_task starts all the way back at the root and picks the task/taskgroup with the min vruntime across all runnable tasks. But when there are many frequently sleeping tasks across different taskgroups, it makes better sense to stay with same taskgroup for its slice period (or until all tasks in the taskgroup sleeps) instead of switching cross taskgroup on each sleep after a short runtime. This helps specifically where taskgroups corresponds to a process with multiple threads. The change reduces the number of CR3 switches in this case. Example: Two taskgroups with 2 threads each which are running for 2ms and sleeping for 1ms. Looking at sched:sched_switch shows: BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017] cpu-soaker-5004 [003] 3683.391089 cpu-soaker-5016 [003] 3683.393106 cpu-soaker-5005 [003] 3683.395119 cpu-soaker-5017 [003] 3683.397130 cpu-soaker-5004 [003] 3683.399143 cpu-soaker-5016 [003] 3683.401155 cpu-soaker-5005 [003] 3683.403168 cpu-soaker-5017 [003] 3683.405170 AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935] cpu-soaker-21890 [003] 865.895494 cpu-soaker-21935 [003] 865.897506 cpu-soaker-21934 [003] 865.899520 cpu-soaker-21935 [003] 865.901532 cpu-soaker-21934 [003] 865.903543 cpu-soaker-21935 [003] 865.905546 cpu-soaker-21891 [003] 865.907548 cpu-soaker-21890 [003] 865.909560 cpu-soaker-21891 [003] 865.911571 cpu-soaker-21890 [003] 865.913582 cpu-soaker-21891 [003] 865.915594 cpu-soaker-21934 [003] 865.917606 Similar problem is there when there are multiple taskgroups and say a task A preempts currently running task B of taskgroup_1. On schedule, pick_next_task can pick an unrelated task on taskgroup_2. Here it would be better to give some preference to task B on pick_next_task. A simple (may be extreme case) benchmark I tried was tbench with 2 tbench client processes with 2 threads each running on a single CPU. Avg throughput across 5 50 sec runs was: BEFORE: 105.84 MB/sec AFTER: 112.42 MB/sec Signed-off-by: Venkatesh Pallipadi <venki@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-19 10:08:38 +02:00
Venkatesh Pallipadi	69c80f3e9d	sched: Make set__buddy() work on non-task entities Make set__buddy() work on non-task sched_entity, to facilitate the use of next_buddy to cache a group entity in cases where one of the tasks within that entity sleeps or gets preempted. set_skip_buddy() was incorrectly comparing the policy of task that is yielding to be not equal to SCHED_IDLE. Yielding should happen even when task yielding is SCHED_IDLE. This change removes the policy check on the yielding task. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-19 10:08:37 +02:00
Rafael J. Wysocki	2ca6f62f59	PM: Fix error code paths executed after failing syscore_suspend() If syscore_suspend() fails in suspend_enter(), create_image() or resume_target_kernel(), it is necessary to call sysdev_resume(), because sysdev_suspend() has been called already and succeeded and we are going to abort the transition. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-04-18 23:58:59 +02:00
Linus Torvalds	c78193e9c7	next_pidmap: fix overflow condition next_pidmap() just quietly accepted whatever 'last' pid that was passed in, which is not all that safe when one of the users is /proc. Admittedly the proc code should do some sanity checking on the range (and that will be the next commit), but that doesn't mean that the helper functions should just do that pidmap pointer arithmetic without checking the range of its arguments. So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1" doesn't really matter, the for-loop does check against the end of the pidmap array properly (it's only the actual pointer arithmetic overflow case we need to worry about, and going one bit beyond isn't going to overflow). [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ] Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Analyzed-by: Robert Święcki <robert@swiecki.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-18 10:35:30 -07:00
Ingo Molnar	6ddafdaab3	Merge branch 'sched/locking' into sched/core Merge reason: the rq locking changes are stable, propagate them into the .40 queue. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-18 14:53:33 +02:00
Richard Cochran	1791f88143	posix clocks: Replace mutex with reader/writer semaphore A dynamic posix clock is protected from asynchronous removal by a mutex. However, using a mutex has the unwanted effect that a long running clock operation in one process will unnecessarily block other processes. For example, one process might call read() to get an external time stamp coming in at one pulse per second. A second process calling clock_gettime would have to wait for almost a whole second. This patch fixes the issue by using a reader/writer semaphore instead of a mutex. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/%3C20110330132421.GA31771%40riccoc20.at.omicron.at%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-04-18 10:39:38 +02:00
Linus Torvalds	d733ed6c34	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: make unplug timer trace event correspond to the schedule() unplug block: let io_schedule() flush the plug inline	2011-04-16 10:33:41 -07:00
Linus Torvalds	fdfc552abe	Merge branches 'core-fixes-for-linus', 'perf-fixes-for-linus', 'sched-fixes-for-linus', 'timer-fixes-for-linus' and 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() perf: Fix a build error with some GCC versions * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix erroneous all_pinned logic sched: Fix sched-domain avg_load calculation * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: RTC: rtc-mrst: follow on to the change of rtc_device_register() RTC: add missing "return 0" in new alarm func for rtc-bfin.c RTC: Fix s3c compile error due to missing s3c_rtc_setpie RTC: Fix early irqs caused by calling rtc_set_alarm too early * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, amd: Disable GartTlbWlkErr when BIOS forgets it x86, NUMA: Fix fakenuma boot failure x86/mrst: Fix boot crash caused by incorrect pin to irq mapping x86/ce4100: Add reg property to bridges	2011-04-16 09:45:08 -07:00
Jens Axboe	49cac01e1f	block: make unplug timer trace event correspond to the schedule() unplug It's a pretty close match to what we had before - the timer triggering would mean that nobody unplugged the plug in due time, in the new scheme this matches very closely what the schedule() unplug now is. It's essentially the difference between an explicit unplug (IO unplug) or an implicit unplug (timer unplug, we scheduled with pending IO queued). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-16 13:51:05 +02:00
Jens Axboe	a237c1c5bc	block: let io_schedule() flush the plug inline Linus correctly observes that the most important dispatch cases are now done from kblockd, this isn't ideal for latency reasons. The original reason for switching dispatches out-of-line was to avoid too deep a stack, so by _only_ letting the "accidental" flush directly in schedule() be guarded by offload to kblockd, we should be able to get the best of both worlds. So add a blk_schedule_flush_plug() that offloads to kblockd, and only use that from the schedule() path. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-16 13:27:55 +02:00
Linus Torvalds	5853b4f06f	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: only force kblockd unplugging from the schedule() path block: cleanup the block plug helper functions block, blk-sysfs: Use the variable directly instead of a function call block: move queue run on unplug to kblockd block: kill queue_sync_plugs() block: readd plug trace event block: add callback function for unplug notification block: add comment on why we save and disable interrupts in flush_plug_list() block: fixup block IO unplug trace call block: remove block_unplug_timer() trace point block: splice plug list to local context	2011-04-15 08:01:13 -07:00
Darren Hart	0cd9c6494e	futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup The FLAGS_HAS_TIMEOUT flag was not getting set, causing the restart_block to restart futex_wait() without a timeout after a signal. Commit `b41277dc7a` in 2.6.38 introduced the regression by accidentally removing the the FLAGS_HAS_TIMEOUT assignment from futex_wait() during the setup of the restart block. Restore the originaly behavior. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=32922 Reported-by: Tim Smith <tsmith201104@yahoo.com> Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Signed-off-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/%3Cdaac0eb3af607f72b9a4d3126b2ba8fb5ed3b883.1302820917.git.dvhart%40linux.intel.com%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-04-15 16:34:32 +02:00
Peter Zijlstra	317f394160	sched: Move the second half of ttwu() to the remote cpu Now that we've removed the rq->lock requirement from the first part of ttwu() and can compute placement without holding any rq->lock, ensure we execute the second half of ttwu() on the actual cpu we want the task to run on. This avoids having to take rq->lock and doing the task enqueue remotely, saving lots on cacheline transfers. As measured using: http://oss.oracle.com/~mason/sembench.c $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done $ echo 4096 32000 64 128 > /proc/sys/kernel/sem $ ./sembench -t 2048 -w 1900 -o 0 unpatched: run time 30 seconds 647278 worker burns per second patched: run time 30 seconds 816715 worker burns per second Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.515897185@chello.nl	2011-04-14 08:52:41 +02:00
Peter Zijlstra	bd8e7dded8	sched: Remove need_migrate_task() Oleg noticed that need_migrate_task() doesn't need the ->on_cpu check now that ttwu() doesn't do remote enqueues for !->on_rq && ->on_cpu, so remove the helper and replace the single instance with a direct ->on_rq test. Suggested-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.556674812@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:41 +02:00
Peter Zijlstra	c05fbafba1	sched: Restructure ttwu() some more Factor our helper functions to make the inner workings of try_to_wake_up() more obvious, this also allows for adding remote queues. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.475848012@chello.nl	2011-04-14 08:52:40 +02:00
Peter Zijlstra	23f41eeb42	sched: Rename ttwu_post_activation() to ttwu_do_wakeup() The ttwu_post_activation() code does the core wakeup, it sets TASK_RUNNING and performs wakeup-preemption, so give is a more descriptive name. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.434609705@chello.nl	2011-04-14 08:52:40 +02:00
Peter Zijlstra	b84cb5df1f	sched: Remove rq argument from ttwu_stat() In order to call ttwu_stat() without holding rq->lock we must remove its rq argument. Since we need to change rq stats, account to the local rq instead of the task rq, this is safe since we have IRQs disabled. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.394638826@chello.nl	2011-04-14 08:52:40 +02:00
Peter Zijlstra	e4a52bcb9a	sched: Remove rq->lock from the first half of ttwu() Currently ttwu() does two rq->lock acquisitions, once on the task's old rq, holding it over the p->state fiddling and load-balance pass. Then it drops the old rq->lock to acquire the new rq->lock. By having serialized ttwu(), p->sched_class, p->cpus_allowed with p->pi_lock, we can now drop the whole first rq->lock acquisition. The p->pi_lock serializing concurrent ttwu() calls protects p->state, which we will set to TASK_WAKING to bridge possible p->pi_lock to rq->lock gaps and serialize set_task_cpu() calls against task_rq_lock(). The p->pi_lock serialization of p->sched_class allows us to call scheduling class methods without holding the rq->lock, and the serialization of p->cpus_allowed allows us to do the load-balancing bits without races. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.354401150@chello.nl	2011-04-14 08:52:39 +02:00
Peter Zijlstra	8f42ced974	sched: Drop rq->lock from sched_exec() Since we can now call select_task_rq() and set_task_cpu() with only p->pi_lock held, and sched_exec() load-balancing has always been optimistic, drop all rq->lock usage. Oleg also noted that need_migrate_task() will always be true for current, so don't bother calling that at all. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.314204889@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:39 +02:00
Peter Zijlstra	ab2515c4b9	sched: Drop rq->lock from first part of wake_up_new_task() Since p->pi_lock now protects all things needed to call select_task_rq() avoid the double remote rq->lock acquisition and rely on p->pi_lock. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.273362517@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:38 +02:00
Peter Zijlstra	0122ec5b02	sched: Add p->pi_lock to task_rq_lock() In order to be able to call set_task_cpu() while either holding p->pi_lock or task_rq(p)->lock we need to hold both locks in order to stabilize task_rq(). This makes task_rq_lock() acquire both locks, and have __task_rq_lock() validate that p->pi_lock is held. This increases the locking overhead for most scheduler syscalls but allows reduction of rq->lock contention for some scheduler hot paths (ttwu). Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.232781355@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:38 +02:00
Peter Zijlstra	2acca55ed9	sched: Also serialize ttwu_local() with p->pi_lock Since we now serialize ttwu() using p->pi_lock, we also need to serialize ttwu_local() using that, otherwise, once we drop the rq->lock from ttwu() it can race with ttwu_local(). Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.192366907@chello.nl	2011-04-14 08:52:37 +02:00
Peter Zijlstra	a8e4f2eaec	sched: Delay task_contributes_to_load() In prepratation of having to call task_contributes_to_load() without holding rq->lock, we need to store the result until we do and can update the rq accounting accordingly. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.151523907@chello.nl	2011-04-14 08:52:37 +02:00
Peter Zijlstra	3fe1698b7f	sched: Deal with non-atomic min_vruntime reads on 32bits In order to avoid reading partial updated min_vruntime values on 32bit implement a seqcount like solution. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:37 +02:00
Peter Zijlstra	74f8e4b233	sched: Remove rq argument to sched_class::task_waking() In preparation of calling this without rq->lock held, remove the dependency on the rq argument. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:36 +02:00
Peter Zijlstra	7608dec2ce	sched: Drop the rq argument to sched_class::select_task_rq() In preparation of calling select_task_rq() without rq->lock held, drop the dependency on the rq argument. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:36 +02:00
Peter Zijlstra	013fdb8086	sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock Currently p->pi_lock already serializes p->sched_class, also put p->cpus_allowed and try_to_wake_up() under it, this prepares the way to do the first part of ttwu() without holding rq->lock. By having p->sched_class and p->cpus_allowed serialized by p->pi_lock, we prepare the way to call select_task_rq() without holding rq->lock. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152728.990364093@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:35 +02:00
Peter Zijlstra	fd2f4419b4	sched: Provide p->on_rq Provide a generic p->on_rq because the p->se.on_rq semantics are unfavourable for lockless wakeups but needed for sched_fair. In particular, p->on_rq is only cleared when we actually dequeue the task in schedule() and not on any random dequeue as done by things like __migrate_task() and __sched_setscheduler(). This also allows us to remove p->se usage from !sched_fair code. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152728.949545047@chello.nl	2011-04-14 08:52:35 +02:00
Peter Zijlstra	d7c01d27ab	sched: Clean up ttwu() stats Collect all ttwu() stat code into a single function and ensure its always called for an actual wakeup (changing p->state to TASK_RUNNING). Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152728.908177058@chello.nl	2011-04-14 08:52:34 +02:00
Peter Zijlstra	893633817f	sched: Change the ttwu() success details try_to_wake_up() would only return a success when it would have to place a task on a rq, change that to every time we change p->state to TASK_RUNNING, because that's the real measure of wakeups. This results in that success is always true for the tracepoints. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152728.866866929@chello.nl	2011-04-14 08:52:34 +02:00
Peter Zijlstra	c2f7115e2e	sched: Move wq_worker_waking to the correct site wq_worker_waking_up() needs to match wq_worker_sleeping(), since the latter is only called on deactivate, move the former near activate. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/n/top-t3m7n70n9frmv4pv2n5fwmov@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:33 +02:00
Peter Zijlstra	c6eb3dda25	mutex: Use p->on_cpu for the adaptive spin Since we now have p->on_cpu unconditionally available, use it to re-implement mutex_spin_on_owner. Requested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152728.826338173@chello.nl	2011-04-14 08:52:33 +02:00
Peter Zijlstra	3ca7a440da	sched: Always provide p->on_cpu Always provide p->on_cpu so that we can determine if its on a cpu without having to lock the rq. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152728.785452014@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:52:32 +02:00
Ingo Molnar	a4c98f8bbe	Merge branch 'linus' into sched/locking Merge reason: Pick up this upstream commit: `6631e635c6`: block: don't flush plugged IO on forced preemtion scheduling As it modifies the scheduler and we'll queue up dependent patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-14 08:51:07 +02:00
Linus Torvalds	6631e635c6	block: don't flush plugged IO on forced preemtion scheduling We really only want to unplug the pending IO when the process actually goes to sleep. So move the test for flushing the plug up to the place where we actually deactivate the task - where we have properly checked for preemption and for the process really sleeping. Acked-by: Jens Axboe <jaxboe@fusionio.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-13 08:08:20 -07:00
Jens Axboe	94b5eb28b4	block: fixup block IO unplug trace call It was removed with the on-stack plugging, readd it and track the depth of requests added when flushing the plug. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 10:12:19 +02:00
Jens Axboe	d9c9783317	block: remove block_unplug_timer() trace point We no longer have an unplug timer running, so no point in keeping the trace point. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-12 10:06:33 +02:00
Shriram Rajagopalan	d419e4c0f7	fix XEN_SAVE_RESTORE Kconfig dependencies Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS. Remove XEN_SAVE_RESTORE dependency from PM_SLEEP. Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-04-11 22:54:48 +02:00
Rafael J. Wysocki	1f112cee07	PM / Hibernate: Introduce CONFIG_HIBERNATE_CALLBACKS Xen save/restore is going to use hibernate device callbacks for quiescing devices and putting them back to normal operations and it would need to select CONFIG_HIBERNATION for this purpose. However, that also would cause the hibernate interfaces for user space to be enabled, which might confuse user space, because the Xen kernels don't support hibernation. Moreover, it would be wasteful, as it would make the Xen kernels include a substantial amount of code that they would never use. To address this issue introduce new power management Kconfig option CONFIG_HIBERNATE_CALLBACKS, such that it will only select the code that is necessary for the hibernate device callbacks to work and make CONFIG_HIBERNATION select it. Then, Xen save/restore will be able to select CONFIG_HIBERNATE_CALLBACKS without dragging the entire hibernate code along with it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>	2011-04-11 22:54:42 +02:00
Peter Zijlstra	60495e7760	sched: Dynamic sched_domain::level Remove the SD_LV_ enum and use dynamic level assignments. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.969433965@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 14:09:32 +02:00
Peter Zijlstra	54ab4ff431	sched: Move sched domain storage into the topology list In order to remove the last dependency on the statid domain levels, move the sd_data storage into the topology structure. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.924926412@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 14:09:31 +02:00
Peter Zijlstra	d069b916f7	sched: Reverse the topology list In order to get rid of static sched_domain::level assignments, reverse the topology iteration. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.876506131@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 14:09:29 +02:00
Peter Zijlstra	2c402dc3bb	sched: Unify the sched_domain build functions Since all the __build_$DOM_sched_domain() functions do pretty much the same thing, unify them. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.826347257@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 14:09:27 +02:00
Peter Zijlstra	eb7a74e6cd	sched: Stuff the sched_domain creation in a data-structure In order to make the topology contruction fully dynamic, remove the still hard-coded list of possible domains and stick them in a data-structure. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.770335383@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 14:09:26 +02:00
Peter Zijlstra	d3081f52f2	sched: Create proper cpu_$DOM_mask() functions In order to unify the sched domain creation more, create proper cpu_$DOM_mask() functions for those domains that didn't already have one. Use the sched_domains_tmpmask for the weird NUMA domain span. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.717702108@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 14:09:24 +02:00
Peter Zijlstra	4cb988395d	sched: Avoid allocations in sched_domain_debug() Since we're all serialized by sched_domains_mutex we can use sched_domains_tmpmask and avoid having to do allocations. This means we can use sched_domains_debug() for cpu_attach_domain() again. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.664347467@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 14:05:00 +02:00
Peter Zijlstra	f96225fd51	sched: Create persistent sched_domains_tmpmask Since sched domain creation is fully serialized by the sched_domains_mutex we can create a single persistent tmpmask to use during domain creation. This removes the need for s_data::send_covered. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.607287405@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:23 +02:00
Peter Zijlstra	7dd04b7307	sched: Remove some dead code Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.553814623@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:22 +02:00
Peter Zijlstra	bf28b25326	sched: Remove nodemask allocation There's only one nodemask user left so remove it with a direct computation and save some memory and reduce some code-flow complexity. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.505608966@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:22 +02:00
Peter Zijlstra	3bd65a80af	sched: Simplify NODE/ALLNODES domain creation Don't treat ALLNODES/NODE different for difference's sake. Simply always create the ALLNODES domain and let the sd_degenerate() checks kill it when its redundant. This simplifies the code flow. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.455464579@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:21 +02:00
Peter Zijlstra	a6c75f2f8d	sched: Avoid using sd->level Don't use sd->level for identifying properties of the domain. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:20 +02:00
Peter Zijlstra	822ff793c3	sched: Simplify the free path some If we check the root_domain reference count we can see if its been used or not, use this observation to simplify some of the return paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.298339503@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:20 +02:00
Peter Zijlstra	dce840a087	sched: Dynamically allocate sched_domain/sched_group data-structures Instead of relying on static allocations for the sched_domain and sched_group trees, dynamically allocate and RCU free them. Allocating this dynamically also allows for some build_sched_groups() simplification since we can now (like with other simplifications) rely on the sched_domain tree instead of hard-coded knowledge. One tricky to note is that detach_destroy_domains() needs to hold rcu_read_lock() over the entire tear-down, per-cpu is not sufficient since that can lead to partial sched_group existance (could possibly be solved by doing the tear-down backwards but this is much more robust). A concequence of the above is that we can no longer print the sched_domain debug stuff from cpu_attach_domain() since that might now run with preemption disabled (due to classic RCU etc.) and sched_domain_debug() does some GFP_KERNEL allocations. Another thing to note is that we now fully rely on normal RCU and not RCU-sched, this is because with the new and exiting RCU flavours we grew over the years BH doesn't necessarily hold off RCU-sched grace periods (-rt is known to break this). This would in fact already cause us grief since we do sched_domain/sched_group iterations from softirq context. This patch is somewhat larger than I would like it to be, but I didn't find any means of shrinking/splitting this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:19 +02:00
Peter Zijlstra	a9c9a9b6bf	sched: Simplify sched_groups_power initialization Again, instead of relying on knowing the possible domains and their order, simply rely on the sched_domain tree and whatever domains are present in there to initialize the sched_group cpu_power. Note: we need to iterate the CPU mask backwards because of the cpumask_first() condition for iterating up the tree. By iterating the mask backwards we ensure all groups of a domain are set-up before starting on the parent groups that rely on its children to be completely done. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.187335414@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:19 +02:00
Peter Zijlstra	21d42ccfd6	sched: Simplify finding the lowest sched_domain Instead of relying on knowing the build order and various CONFIG_ flags simply remember the bottom most sched_domain when we created the domain hierarchy. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.134511046@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:19 +02:00
Peter Zijlstra	1cf5190254	sched: Simplify sched_group creation Instead of calling build_sched_groups() for each possible sched_domain we might have created, note that we can simply iterate the sched_domain tree and call it for each sched_domain present. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.077862519@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:18 +02:00
Peter Zijlstra	3739494e08	sched: Clean up some ALLNODES code Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.025636011@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:18 +02:00
Peter Zijlstra	cd4ea6ae39	sched: Change NODE sched_domain group creation The NODE sched_domain is 'special' in that it allocates sched_groups per CPU, instead of sharing the sched_groups between all CPUs. While this might have some benefits on large NUMA and avoid remote memory accesses when iterating the sched_groups, this does break current code that assumes sched_groups are shared between all sched_domains (since the dynamic cpu_power patches). So refactor the NODE groups to behave like all other groups. (The ALLNODES domain again shared its groups across the CPUs for some reason). If someone does measure a performance decrease due to this change we need to revisit this and come up with another way to have both dynamic cpu_power and NUMA work nice together. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122941.978111700@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:17 +02:00
Peter Zijlstra	a06dadbec5	sched: Simplify build_sched_groups() Notice that the mask being computed is the same as the domain span we just computed. By using the domain_span we can avoid some mask allocations and computations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122941.925028189@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:17 +02:00
Peter Zijlstra	d274cb30f4	sched: Simplify ->cpu_power initialization The code in update_group_power() does what init_sched_groups_power() does and more, so remove the special init_ code and call the generic code instead. Also move the sd->span_weight initialization because update_group_power() needs it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122941.875856012@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:16 +02:00
Peter Zijlstra	c4a8849af9	sched: Remove obsolete arch_ prefixes Non weak static functions clearly are not arch specific, so remove the arch_ prefix. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122941.820460566@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 12:58:16 +02:00
Shaohua Li	f4ad9bd208	sched: Eliminate dead code from wakeup_gran() calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroe Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 11:08:55 +02:00
Ken Chen	b30aef17f7	sched: Fix erroneous all_pinned logic The scheduler load balancer has specific code to deal with cases of unbalanced system due to lots of unmovable tasks (for example because of hard CPU affinity). In those situation, it excludes the busiest CPU that has pinned tasks for load balance consideration such that it can perform second 2nd load balance pass on the rest of the system. This all works as designed if there is only one cgroup in the system. However, when we have multiple cgroups, this logic has false positives and triggers multiple load balance passes despite there are actually no pinned tasks at all. The reason it has false positives is that the all pinned logic is deep in the lowest function of can_migrate_task() and is too low level: load_balance_fair() iterates each task group and calls balance_tasks() to migrate target load. Along the way, balance_tasks() will also set a all_pinned variable. Given that task-groups are iterated, this all_pinned variable is essentially the status of last group in the scanning process. Task group can have number of reasons that no load being migrated, none due to cpu affinity. However, this status bit is being propagated back up to the higher level load_balance(), which incorrectly think that no tasks were moved. It kick off the all pinned logic and start multiple passes attempt to move load onto puller CPU. To fix this, move the all_pinned aggregation up at the iterator level. This ensures that the status is aggregated over all task-groups, not just last one in the list. Signed-off-by: Ken Chen <kenchen@google.com> Cc: stable@kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 11:08:54 +02:00
Ken Chen	b0432d8f16	sched: Fix sched-domain avg_load calculation In function find_busiest_group(), the sched-domain avg_load isn't calculated at all if there is a group imbalance within the domain. This will cause erroneous imbalance calculation. The reason is that calculate_imbalance() sees sds->avg_load = 0 and it will dump entire sds->max_load into imbalance variable, which is used later on to migrate entire load from busiest CPU to the puller CPU. This has two really bad effect: 1. stampede of task migration, and they won't be able to break out of the bad state because of positive feedback loop: large load delta -> heavier load migration -> larger imbalance and the cycle goes on. 2. severe imbalance in CPU queue depth. This causes really long scheduling latency blip which affects badly on application that has tight latency requirement. The fix is to have kernel calculate domain avg_load in both cases. This will ensure that imbalance calculation is always sensible and the target is usually half way between busiest and puller CPU. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 11:08:54 +02:00
Stephane Eranian	e566b76ed3	perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() There is a bug in perf_event_enable_on_exec() when cgroup events are active on a CPU: the cgroup events may be scheduled twice causing event state corruptions which eventually may lead to kernel panics. The reason is that the function needs to first schedule out the cgroup events, just like for the per-thread events. The cgroup event are scheduled back in automatically from the perf_event_context_sched_in() function. The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any bogus state. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110406005454.GA1062@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-11 11:07:55 +02:00
Justin P. Mattock	6875669906	arch:Kconfig.locks Remove unused config option. Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-04-10 17:01:05 +02:00
Justin P. Mattock	6eab04a876	treewide: remove extra semicolons Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-04-10 17:01:05 +02:00
Randy Dunlap	f9fa0bc1fa	signal.c: fix erroneous syscall kernel-doc Fix erroneous syscall kernel-doc comments in kernel/signal.c. Reported-by: Matt Fleming <matt@console-pimps.org> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-08 11:05:24 -07:00
Linus Torvalds	8b9686ff4d	Merge branches 'x86-fixes-for-linus', 'sched-fixes-for-linus', 'timers-fixes-for-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-32, fpu: Fix FPU exception handling on non-SSE systems x86, hibernate: Initialize mmu_cr4_features during boot x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change x86: visws: Fixup irq overhaul fallout * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Clean up rebalance_domains() load-balance interval calculation * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86/mrst/vrtc: Fix boot crash in mrst_rtc_init() rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm() * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix cpumask leak in __setup_irq() * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf probe: Fix listing incorrect line number with inline function perf probe: Fix to find recursively inlined function perf probe: Fix multiple --vars options behavior perf probe: Fix to remove redundant close perf probe: Fix to ensure function declared file	2011-04-07 12:12:58 -07:00
Oleg Nesterov	e46bc9b6fd	Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into ptrace	2011-04-07 20:44:11 +02:00
Linus Torvalds	42933bac11	Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings	2011-04-07 11:14:49 -07:00
Peter Zijlstra	49c022e657	sched: Clean up rebalance_domains() load-balance interval calculation Instead of the possible multiple-evaluation of num_online_cpus() in rebalance_domains() that Linus reported, avoid it altogether in the normal case since it's implemented with a Hamming weight function over a cpu bitmask which can be darn expensive for those with big iron. This also makes it cleaner, smaller and documents the code. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1301991265.2225.12.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-05 10:29:36 +02:00
Randy Dunlap	41c57892a2	kernel/signal.c: add kernel-doc notation to syscalls Add kernel-doc to syscalls in signal.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-04 17:51:46 -07:00
Randy Dunlap	5aba085ede	kernel/signal.c: fix typos and coding style General coding style and comment fixes; no code changes: - Use multi-line-comment coding style. - Put some function signatures completely on one line. - Hyphenate some words. - Spell Posix as POSIX. - Correct typos & spellos in some comments. - Drop trailing whitespace. - End sentences with periods. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-04 17:51:46 -07:00
Jason Baron	d430d3d7e6	jump label: Introduce static_branch() interface Introduce: static __always_inline bool static_branch(struct jump_label_key key); instead of the old JUMP_LABEL(key, label) macro. In this way, jump labels become really easy to use: Define: struct jump_label_key jump_key; Can be used as: if (static_branch(&jump_key)) do unlikely code enable/disale via: jump_label_inc(&jump_key); jump_label_dec(&jump_key); that's it! For the jump labels disabled case, the static_branch() becomes an atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(), atomic_dec() operations. We show testing results for this change below. Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct. Since we now require a 'struct jump_label_key key', we can store a pointer into the jump table addresses. In this way, we can enable/disable jump labels, in basically constant time. This change allows us to completely remove the previous hashtable scheme. Thanks to Peter Zijlstra for this re-write. Testing: I ran a series of 'tbench 20' runs 5 times (with reboots) for 3 configurations, where tracepoints were disabled. jump label configured in avg: 815.6 jump label not configured in (using atomic reads) avg: 800.1 jump label not configured in (regular reads) avg: 803.4 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110316212947.GA8792@redhat.com> Signed-off-by: Jason Baron <jbaron@redhat.com> Suggested-by: H. Peter Anvin <hpa@linux.intel.com> Tested-by: David Daney <ddaney@caviumnetworks.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-04 12:48:08 -04:00
Jiri Olsa	ee5e51f51b	tracing: Avoid soft lockup in trace_pipe running following commands: # enable the binary option echo 1 > ./options/bin # disable context info option echo 0 > ./options/context-info # tracing only events echo 1 > ./events/enable cat trace_pipe plus forcing system to generate many tracing events, is causing lockup (in NON preemptive kernels) inside tracing_read_pipe function. The issue is also easily reproduced by running ltp stress test. (ftrace_stress_test.sh) The reasons are: - bin/hex/raw output functions for events are set to trace_nop_print function, which prints nothing and returns TRACE_TYPE_HANDLED value - LOST EVENT trace do not handle trace_seq overflow These reasons force the while loop in tracing_read_pipe function never to break. The attached patch fixies handling of lost event trace, and changes trace_nop_print to print minimal info, which is needed for the correct tracing_read_pipe processing. v2 changes: - omit the cond_resched changes by trace_nop_print changes - WARN changed to WARN_ONCE and added info to be able to find out the culprit v3 changes: - make more accurate patch comment Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <20110325110518.GC1922@jolsa.brq.redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-04 12:18:24 -04:00
Steven Rostedt	1813dc3776	tracing: Print trace_bprintk() formats for modules too The file debugfs/tracing/printk_formats maps the addresses to the formats that are used by trace_bprintk() so that userspace tools can read the buffer and be able to decode trace_bprintk events to get the format saved when reading the ring buffer directly. This is because trace_bprintk() does not store the format into the buffer, but just the address of the format, which is hidden in the kernel memory. But currently it only exports trace_bprintk()s from the kernel core and not for modules. The modules need their formats exported as well. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-04 12:18:24 -04:00
Steven Rostedt	0588fa30db	tracing: Convert trace_printk() formats for module to const char * The trace_printk() formats for modules do not show up in the debugfs/tracing/printk_formats file. Only the formats that are for trace_printk()s that are in the kernel core. To facilitate the change to add trace_printk() formats from modules into that file as well, we need to convert the structure that holds the formats from char fmt[], into const char *fmt, and allocate them separately. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-04 12:18:24 -04:00
Linus Torvalds	148086bb64	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix rebalance interval calculation sched, doc: Beef up load balancing description sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks	2011-04-04 08:36:58 -07:00
Linus Torvalds	4da7e90e65	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix task_struct reference leak perf: Fix task context scheduling perf: mmap 512 kiB by default perf: Rebase max unprivileged mlock threshold on top of page size perf tools: Fix NO_NEWT=1 python build error perf symbols: Properly align symbol_conf.priv_size perf tools: Emit clearer message for sys_perf_event_open ENOENT return perf tools: Fixup exit path when not able to open events perf symbols: Fix vsyscall symbol lookup oprofile, x86: Allow setting EDGE/INV/CMASK for counter events	2011-04-04 08:36:40 -07:00
Richard Cochran	4352d9d44b	ntp: fix non privileged system time shifting The ADJ_SETOFFSET bit added in commit `094aa188` ("ntp: Add ADJ_SETOFFSET mode bit") also introduced a way for any user to change the system time. Sneaky or buggy calls to adjtimex() could set ADJ_OFFSET_SS_READ \| ADJ_SETOFFSET which would result in a successful call to timekeeping_inject_offset(). This patch fixes the issue by adding the capability check. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-04 08:31:23 -07:00
Eric Paris	a3232d2fa2	capabilities: delete all CAP_INIT macros The CAP_INIT macros of INH, BSET, and EFF made sense at one point in time, but now days they aren't helping. Just open code the logic in the init_cred. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2011-04-04 10:31:16 +10:00
Eric Paris	5163b583a0	capabilities: delete unused cap_set_full unused code. Clean it up. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>	2011-04-04 10:31:12 +10:00
Eric Paris	ffa8e59df0	capabilities: do not drop CAP_SETPCAP from the initial task In olden' days of yore CAP_SETPCAP had special meaning for the init task. We actually have code to make sure that CAP_SETPCAP wasn't in pE of things using the init_cred. But CAP_SETPCAP isn't so special any more and we don't have a reason to special case dropping it for init or kthreads.... Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>	2011-04-04 10:31:09 +10:00
Eric Paris	17f60a7da1	capabilites: allow the application of capability limits to usermode helpers There is no way to limit the capabilities of usermodehelpers. This problem reared its head recently when someone complained that any user with cap_net_admin was able to load arbitrary kernel modules, even though the user didn't have cap_sys_module. The reason is because the actual load is done by a usermode helper and those always have the full cap set. This patch addes new sysctls which allow us to bound the permissions of usermode helpers. /proc/sys/kernel/usermodehelper/bset /proc/sys/kernel/usermodehelper/inheritable You must have CAP_SYS_MODULE and CAP_SETPCAP to change these (changes are &= ONLY). When the kernel launches a usermodehelper it will do so with these as the bset and pI. -v2: make globals static create spinlock to protect globals -v3: require both CAP_SETPCAP and CAP_SYS_MODULE -v4: fix the typo s/CAP_SET_PCAP/CAP_SETPCAP/ because I didn't commit Signed-off-by: Eric Paris <eparis@redhat.com> No-objection-from: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>	2011-04-04 10:31:04 +10:00
Oleg Nesterov	321fb56197	ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/ After "ptrace: Clean transitions between TASK_STOPPED and TRACED" `d79fdd6d96`, ptrace_check_attach() should never see a TASK_STOPPED tracee and s/STOPPED/TRACED/ is no longer legal. Add the warning. Note: ptrace_check_attach() can be greatly simplified, in particular it doesn't need tasklist. But I'd prefer another patch for that. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-04-04 02:11:05 +02:00
Oleg Nesterov	ee77f07592	signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED This patch moves SIGNAL_STOP_DEQUEUED from signal_struct->flags to task_struct->group_stop, and thus makes it per-thread. Like SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can be false-positive after return from get_signal_to_deliver(), this is fine. The only purpose of this bit is: we can drop ->siglock after __dequeue_signal() returns the sig_kernel_stop() signal and before we call do_signal_stop(), in this case we must not miss SIGCONT if it comes in between. But, unlike SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can not be false-positive in do_signal_stop() if multiple threads dequeue the sig_kernel_stop() signal at the same time. Consider two threads T1 and T2, SIGTTIN has a hanlder. - T1 dequeues SIGTSTP and sets SIGNAL_STOP_DEQUEUED, then it drops ->siglock - SIGCONT comes and clears SIGNAL_STOP_DEQUEUED, SIGTSTP should be cancelled. - T2 dequeues SIGTTIN and sets SIGNAL_STOP_DEQUEUED again. Since we have a handler we should not stop, T2 returns to usermode to run the handler. - T1 continues, calls do_signal_stop() and wrongly starts the group stop because SIGNAL_STOP_DEQUEUED was restored in between. With or without this change: - we need to do something with ptrace_signal() which can return SIGSTOP, but this needs another discussion - SIGSTOP can be lost if it races with the mt exec, will be fixed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-04-04 02:11:05 +02:00
Oleg Nesterov	780006eac2	signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending() PF_EXITING or TASK_STOPPED has already called task_participate_group_stop() and cleared its ->group_stop. No need to do task_clear_group_stop_pending() when we start the new group stop. Add a small comment to explain the !task_is_stopped() check. Note that this check is not exactly right and it can lead to unnecessary stop later if the thread is TASK_PTRACED. What we need is task_participated_in_group_stop(), this will be solved later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-04-04 02:11:04 +02:00
Oleg Nesterov	1deac632fc	signal: prepare_signal(SIGCONT) shouldn't play with TIF_SIGPENDING prepare_signal(SIGCONT) should never set TIF_SIGPENDING or wake up the TASK_INTERRUPTIBLE threads. We are going to call complete_signal() which should pick the right thread correctly. All we need is to wake up the TASK_STOPPED threads. If the task was stopped, it can't return to usermode without taking ->siglock. Otherwise we don't care, and the spurious TIF_SIGPENDING can't be useful. The comment says: * If there is a handler for SIGCONT, we must make * sure that no thread returns to user mode before * we post the signal It is not clear what this means. Probably, "when there's only a single thread" and this continues to be true. Otherwise, even if this SIGCONT is not private, with or without this change only one thread can dequeue SIGCONT, other threads can happily return to user mode before before that thread handles this signal. Note also that wake_up_state(t, __TASK_STOPPED) can't race with the task which changes its state, TASK_STOPPED state is protected by ->siglock as well. In short: when it comes to signal delivery, SIGCONT is the normal signal and does not need any special support. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-04-04 02:11:04 +02:00
Xiaotian Feng	4f5058c3b7	genirq: Fix cpumask leak in __setup_irq() The allocated cpumask should be freed in __setup_irq(). Signed-off-by: Xiaotian Feng <dfeng@redhat.com> LKML-Reference: <1301744375-6812-1-git-send-email-dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-04-02 21:26:20 +02:00
Anton Blanchard	c0bb9e45f3	kdump: Allow shrinking of kdump region to be overridden On ppc64 the crashkernel region almost always overlaps an area of firmware. This works fine except when using the sysfs interface to reduce the kdump region. If we free the firmware area we are guaranteed to crash. Rename free_reserved_phys_range to crash_free_reserved_phys_range and make it a weak function so we can override it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2011-04-01 16:14:30 +11:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Peter Zijlstra	fd1edb3aa2	perf: Fix task_struct reference leak sys_perf_event_open() had an imbalance in the number of task refs it took causing memory leakage Cc: Jiri Olsa <jolsa@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org # .37+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-31 13:02:56 +02:00
Frederic Weisbecker	20443384fe	perf: Rebase max unprivileged mlock threshold on top of page size Ensure we allow 512 kiB + 1 page for user control without assuming a 4096 bytes page size. Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: <stable@kernel.org> LKML-Reference: <1301535209-9679-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-31 13:02:54 +02:00
Sisir Koppaka	3436ae1298	sched: Fix rebalance interval calculation The interval for checking scheduling domains if they are due to be balanced currently depends on boot state NR_CPUS, which may not accurately reflect the number of online CPUs at the time of check. Thus replace NR_CPUS with num_online_cpus(). (ed: Should only affect those who set NR_CPUS really high, such as 4096 or so :-) Signed-off-by: Sisir Koppaka <sisir.koppaka@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <AANLkTikqHWid2Q93F5U5Qw5snJH8C5PXoa7J6=6hYO94@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-31 13:00:37 +02:00
Dario Faggioli	a51e919818	sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks sched_setscheduler() (in sched.c) is called in order of changing the scheduling policy and/or the real-time priority of a task. Thus, if we find out that neither of those are actually being modified, it is possible to return earlier and save the overhead of a full deactivate+activate cycle of the task in question. Beside that, if we have more than one SCHED_FIFO task with the same priority on the same rq (which means they share the same priority queue) having one of them changing its position in the priority queue because of a sched_setscheduler (as it happens by means of the deactivate+activate) that does not actually change the priority violates POSIX which states, for SCHED_FIFO: "If a thread whose policy or priority has been modified by pthread_setschedprio() is a running thread or is runnable, the effect on its position in the thread list depends on the direction of the modification, as follows: a. <...> b. If the priority is unchanged, the thread does not change position in the thread list. c. <...>" http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html (ed: And the POSIX specification here does, briefly and somewhat unexpectedly, match what common sense tells us as well. ) Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1300971618.3960.82.camel@Palantir> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-31 13:00:34 +02:00
Thomas Gleixner	78c8982564	genirq: Remove the now obsolete config options and select statements Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-30 14:13:23 +02:00
Thomas Gleixner	353c8ed44f	genirq: Fix misnamed label in handle_edge_eoi_irq Reported-by: michael@ellerman.id.au Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev@lists.ozlabs.org	2011-03-29 22:24:05 +02:00
Thomas Gleixner	851d7cf647	genirq: Remove move_*irq leftovers All users converted to new interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-29 14:50:32 +02:00
Thomas Gleixner	0c6f8a8b91	genirq: Remove compat code Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-29 14:48:19 +02:00
Thomas Gleixner	a6e120ed42	alpha: Use generic show_interrupts() The only subtle difference is that alpha uses ACTUAL_NR_IRQS and prints the IRQF_DISABLED flag. Change the generic implementation to deal with ACTUAL_NR_IRQS if defined. The IRQF_DISABLED printing is pointless, as we nowadays run all interrupts with irqs disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-29 14:47:58 +02:00
Thomas Gleixner	cd22c0e44b	genirq: Fix harmless typo The late night fixup missed to convert the data type from irq_desc to irq_data, which results in a harmless but annoying warning. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-29 11:36:05 +02:00
Linus Torvalds	e5217fb8ae	Merge branches 'irq-cleanup-for-linus' and 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: vlynq: Convert irq functions * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq; Fix cleanup fallout genirq: Fix typo and remove unused variable genirq: Fix new kernel-doc warnings genirq: Add setter for AFFINITY_SET in irq_data state genirq: Provide setter inline for IRQD_IRQ_INPROGRESS genirq: Remove handle_IRQ_event arm: Ns9xxx: Remove private irq flow handler powerpc: cell: Use the core flow handler genirq: Provide edge_eoi flow handler genirq: Move INPROGRESS, MASKED and DISABLED state flags to irq_data genirq: Split irq_set_affinity() so it can be called with lock held. genirq: Add chip flag for restricting cpu_on/offline calls genirq: Add chip hooks for taking CPUs on/off line. genirq: Add irq disabled flag to irq_data state genirq: Reserve the irq when calling irq_set_chip()	2011-03-28 17:39:54 -07:00
Thomas Gleixner	0ef5ca1e1f	genirq; Fix cleanup fallout I missed the CONFIG_GENERIC_PENDING_IRQ dependency in the affinity related functions and the IRQ_LEVEL propagation into irq_data state. Did not pop up on my main test platforms. :( Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: David Daney <ddaney@caviumnetworks.com>	2011-03-29 01:41:22 +02:00
Roland Dreier	243b422af9	Relax si_code check in rt_sigqueueinfo and rt_tgsigqueueinfo Commit `da48524eb2` ("Prevent rt_sigqueueinfo and rt_tgsigqueueinfo from spoofing the signal code") made the check on si_code too strict. There are several legitimate places where glibc wants to queue a negative si_code different from SI_QUEUE: - This was first noticed with glibc's aio implementation, which wants to queue a signal with si_code SI_ASYNCIO; the current kernel causes glibc's tst-aio4 test to fail because rt_sigqueueinfo() fails with EPERM. - Further examination of the glibc source shows that getaddrinfo_a() wants to use SI_ASYNCNL (which the kernel does not even define). The timer_create() fallback code wants to queue signals with SI_TIMER. As suggested by Oleg Nesterov <oleg@redhat.com>, loosen the check to forbid only the problematic SI_TKILL case. Reported-by: Klaus Dittrich <kladit@arcor.de> Acked-by: Julien Tinnes <jln@google.com> Cc: <stable@kernel.org> Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-28 15:45:44 -07:00
Thomas Gleixner	a6aeddd1c4	genirq: Fix typo and remove unused variable Sigh, I'm overworked. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-28 20:28:56 +02:00
Randy Dunlap	30398bf6c6	genirq: Fix new kernel-doc warnings Fix new irq-related kernel-doc warnings in 2.6.38: Warning(kernel/irq/manage.c:149): No description found for parameter 'mask' Warning(kernel/irq/manage.c:149): Excess function parameter 'cpumask' description in 'irq_set_affinity' Warning(include/linux/irq.h:161): No description found for parameter 'state_use_accessors' Warning(include/linux/irq.h:161): Excess struct/union/enum/typedef member 'state_use_accessor' description in 'irq_data' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <20110318093356.b939558d.randy.dunlap@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-28 20:13:57 +02:00
Thomas Gleixner	33b054b867	genirq: Remove handle_IRQ_event Last user gone. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-28 16:55:11 +02:00
Thomas Gleixner	0521c8fbb3	genirq: Provide edge_eoi flow handler This is a replacment for the cell flow handler which is in the way of cleanups. Must be selected to avoid general bloat. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-28 16:55:11 +02:00
Thomas Gleixner	32f4125ebf	genirq: Move INPROGRESS, MASKED and DISABLED state flags to irq_data We really need these flags for some of the interrupt chips. Move it from internal state to irq_data and provide proper accessors. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: David Daney <ddaney@caviumnetworks.com>	2011-03-28 16:55:10 +02:00
David Daney	c2d0c555c2	genirq: Split irq_set_affinity() so it can be called with lock held. The .irq_cpu_online() and .irq_cpu_offline() functions may need to adjust affinity, but they are called with the descriptor lock held. Create __irq_set_affinity_locked() which is called with the lock held. Make irq_set_affinity() just a wrapper that acquires the lock. [ tglx: Changed the argument to irq_data, added a !desc check and moved the !irq_set_affinity check where it belongs ] Signed-off-by: David Daney <ddaney@caviumnetworks.com> Cc: linux-mips@linux-mips.org Cc: ralf@linux-mips.org LKML-Reference: <1301081931-11240-4-git-send-email-ddaney@caviumnetworks.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-27 17:45:59 +02:00
Thomas Gleixner	b3d422329f	genirq: Add chip flag for restricting cpu_on/offline calls Add a flag which indicates that the on/offline callback should only be called on enabled interrupts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-27 17:45:58 +02:00
David Daney	0fdb4b259e	genirq: Add chip hooks for taking CPUs on/off line. [ tglx: Removed the enabled argument as this is now available in irq_data ] Signed-off-by: David Daney <ddaney@caviumnetworks.com> Cc: linux-mips@linux-mips.org Cc: ralf@linux-mips.org LKML-Reference: <1301081931-11240-3-git-send-email-ddaney@caviumnetworks.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-27 17:45:58 +02:00
Thomas Gleixner	801a0e9ae3	genirq: Add irq disabled flag to irq_data state Some irq_chip implementation require to know the disabled state of the interrupt in certain callbacks. Add a state flag and accessor to irq_data. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-27 17:45:58 +02:00
David Daney	d72274e589	genirq: Reserve the irq when calling irq_set_chip() The helper macros and functions like for_each_active_irq() don't work unless the irq is in the allocated_irqs set. In the case of !CONFIG_SPARSE_IRQ, instead of forcing all users of the irq infrastructure to explicitly call irq_reserve_irq(), do it for them. Signed-off-by: David Daney <ddaney@caviumnetworks.com> Cc: linux-mips@linux-mips.org Cc: ralf@linux-mips.org LKML-Reference: <1301081931-11240-2-git-send-email-ddaney@caviumnetworks.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-27 17:45:58 +02:00
Linus Torvalds	16c29dafcc	Merge branch 'syscore' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'syscore' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: Introduce ARCH_NO_SYSDEV_OPS config option (v2) cpufreq: Use syscore_ops for boot CPU suspend/resume (v2) KVM: Use syscore_ops instead of sysdev class and sysdev PCI / Intel IOMMU: Use syscore_ops instead of sysdev class and sysdev timekeeping: Use syscore_ops instead of sysdev class and sysdev x86: Use syscore_ops instead of sysdev classes and sysdevs	2011-03-25 21:07:59 -07:00
Linus Torvalds	95e14ed7fc	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kdb: add usage string of 'per_cpu' command kgdb,x86_64: fix compile warning found with sparse kdb: code cleanup to use macro instead of value kgdboc,kgdbts: strlen() doesn't count the terminator	2011-03-25 21:04:56 -07:00
Linus Torvalds	0dd61be7ec	Merge branch 'irq-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (23 commits) genirq: Expand generic show_interrupts() gpio: Fold irq_set_chip/irq_set_handler to irq_set_chip_and_handler gpio: Cleanup genirq namespace arm: ep93xx: Add basic interrupt info arm/gpio: Remove three copies of broken and racy debug code xtensa: Use generic show_interrupts() xtensa: Convert genirq namespace xtensa: Use generic IRQ Kconfig and set GENERIC_HARDIRQS_NO_DEPRECATED xtensa: Convert s6000 gpio irq_chip to new functions xtensa: Convert main irq_chip to new functions um: Use generic show_interrupts() um: Convert genirq namespace m32r: Use generic show_interrupts() m32r: Convert genirq namespace h8300: Use generic show_interrupts() h8300: Convert genirq namespace avr32: Cleanup eic_set_irq_type() avr32: Use generic show_interrupts() avr: Cleanup genirq namespace avr32: Use generic IRQ config, enable GENERIC_HARDIRQS_NO_DEPRECATED ... Fix up trivial conflict in drivers/gpio/timbgpio.c	2011-03-25 20:24:05 -07:00
Linus Torvalds	8dd90265ac	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, doc: Update sched-design-CFS.txt sched: Remove unused 'rq' variable and cpu_rq() call from alloc_fair_sched_group() sched.h: Fix a typo ("its") sched: Fix yield_to kernel-doc	2011-03-25 17:59:38 -07:00
Linus Torvalds	2a20b02c05	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue perf, x86: P4 PMU - Read proper MSR register to catch unflagged overflows perf symbols: Look at .dynsym again if .symtab not found perf build-id: Add quirk to deal with perf.data file format breakage perf session: Pass evsel in event_ops->sample() perf: Better fit max unprivileged mlock pages for tools needs perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx() perf top: Fix uninitialized 'counter' variable tracing: Fix set_ftrace_filter probe function display perf, x86: Fix Intel fixed counters base initialization	2011-03-25 17:53:09 -07:00
Linus Torvalds	839767e79e	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Provide locked setter for chip, handler, name genirq: Provide a lockdep helper genirq; Remove the last leftovers of the old sparse irq code	2011-03-25 17:52:53 -07:00
Linus Torvalds	94df491c4a	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Fix WARN_ON() test for UP WARN_ON_SMP(): Allow use in if() statements on UP x86, dumpstack: Use %pB format specifier for stack trace vsprintf: Introduce %pB format specifier lockdep: Remove unused 'factor' variable from lockdep_stats_show()	2011-03-25 17:52:22 -07:00
Namhyung Kim	0d3db28dae	kdb: add usage string of 'per_cpu' command Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2011-03-25 16:37:31 -05:00
Jovi Zhang	27029c339b	kdb: code cleanup to use macro instead of value It's better to use macro KDB_BASE_CMD_MAX instead of 50 Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2011-03-25 16:37:30 -05:00
Thomas Gleixner	ab7798ffcf	genirq: Expand generic show_interrupts() Some archs want to print extra information for certain irq_chips which is per irq and not per chip. Allow them to provide a chip callback to print the chip name and the extra information. PowerPC wants to print the LEVEL/EDGE type information. Make it configurable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-25 17:04:20 +01:00
Steven Rostedt	2909620217	futex: Fix WARN_ON() test for UP An update of the futex code had a WARN_ON(!spin_is_locked(q->lock_ptr)) But on UP, spin_is_locked() is always false, and will trigger this warning, and even worse, it will exit the function without doing the necessary work. Converting this to a WARN_ON_SMP() fixes the problem. Reported-by: Richard Weinberger <richard@nod.at> Tested-by: Richard Weinberger <richard@nod.at> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <dvhart@linux.intel.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <20110317192208.682654502@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-25 11:32:11 +01:00
Tejun Heo	0415b00d17	percpu: Always align percpu output section to PAGE_SIZE Percpu allocator honors alignment request upto PAGE_SIZE and both the percpu addresses in the percpu address space and the translated kernel addresses should be aligned accordingly. The calculation of the former depends on the alignment of percpu output section in the kernel image. The linker script macros PERCPU_VADDR() and PERCPU() are used to define this output section and the latter takes @align parameter. Several architectures are using @align smaller than PAGE_SIZE breaking percpu memory alignment. This patch removes @align parameter from PERCPU(), renames it to PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it, add PCPU_SETUP_BUG_ON() checks such that alignment problems are reliably detected and remove percpu alignment comment recently added in workqueue.c as the condition would trigger BUG way before reaching there. For um, this patch raises the alignment of percpu area. As the area is in .init, there shouldn't be any noticeable difference. This problem was discovered by David Howells while debugging boot failure on mn10300. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Mike Frysinger <vapier@gentoo.org> Cc: uclinux-dist-devel@blackfin.uclinux.org Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: user-mode-linux-devel@lists.sourceforge.net	2011-03-24 18:50:09 +01:00
Linus Torvalds	6c51038900	Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}	2011-03-24 10:16:26 -07:00
Linus Torvalds	3dab04e697	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300 * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300: MN10300: gcc 4.6 vs am33 inline assembly MN10300: Deprecate gdbstub MN10300: Allow KGDB to use the MN10300 serial ports MN10300: Emulate single stepping in KGDB on MN10300 MN10300: Generalise kernel debugger kernel halt, reboot or power off hook KGDB: Notify GDB of machine halt, reboot or power off MN10300: Use KGDB MN10300: Create generic kernel debugger hooks MN10300: Create general kernel debugger cache flushing MN10300: Introduce a general config option for kernel debugger hooks MN10300: The icache invalidate functions should disable the icache first MN10300: gdbstub: Restrict single-stepping to non-preemptable non-SMP configs	2011-03-24 10:07:50 -07:00
Namhyung Kim	0f77a8d378	vsprintf: Introduce %pB format specifier The %pB format specifier is for stack backtrace. Its handler sprint_backtrace() does symbol lookup using (address-1) to ensure the address will not point outside of the function. If there is a tail-call to the function marked "noreturn", gcc optimized out the code after the call then causes saved return address points outside of the function (i.e. the start of the next function), so pollutes call trace somewhat. This patch adds the %pB printk mechanism that allows architecture call-trace printout functions to improve backtrace printouts. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-arch@vger.kernel.org LKML-Reference: <1300934550-21394-1-git-send-email-namhyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-24 08:36:10 +01:00
Linus Torvalds	b81a618dcd	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: deal with races in /proc//{syscall,stack,personality} proc: enable writing to /proc/pid/mem proc: make check_mem_permission() return an mm_struct on success proc: hold cred_guard_mutex in check_mem_permission() proc: disable mem_write after exec mm: implement access_remote_vm mm: factor out main logic of access_process_vm mm: use mm_struct to resolve gate vma's in __get_user_pages mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm mm: arch: make in_gate_area take an mm_struct instead of a task_struct mm: arch: make get_gate_vma take an mm_struct instead of a task_struct x86: mark associated mm when running a task in 32 bit compatibility mode x86: add context tag to mark mm when running a task in 32-bit compatibility mode auxv: require the target to be tracable (or yourself) close race in /proc//environ report errors in /proc//map* sanely pagemap: close races with suid execve make sessionid permissions in /proc//task/ match those in /proc/* fix leaks in path_lookupat() Fix up trivial conflicts in fs/proc/base.c	2011-03-23 20:51:42 -07:00
Olaf Hering	93a72052be	crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn The Xen PV drivers in a crashed HVM guest can not connect to the dom0 backend drivers because both frontend and backend drivers are still in connected state. To run the connection reset function only in case of a crashdump, the is_kdump_kernel() function needs to be available for the PV driver modules. Consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn into kernel/crash_dump.c Also export elfcorehdr_addr to make is_kdump_kernel() usable for modules. Leave 'elfcorehdr' as early_param(). This changes powerpc from __setup() to early_param(). It adds an address range check from x86 also on ia64 and powerpc. [akpm@linux-foundation.org: additional #includes] [akpm@linux-foundation.org: remove elfcorehdr_addr export] [akpm@linux-foundation.org: fix for Tejun's mm/nobootmem.c changes] Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:19 -07:00
Mandeep Singh Baines	f9b182e24e	taskstats: use appropriate printk priority level printk()s without a priority level default to KERN_WARNING. To reduce noise at KERN_WARNING, this patch set the priority level appriopriately for unleveled printks()s. This should be useful to folks that look at dmesg warnings closely. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:14 -07:00
Serge E. Hallyn	b0e77598f8	userns: user namespaces: convert several capable() calls CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(), because the resource comes from current's own ipc namespace. setuid/setgid are to uids in own namespace, so again checks can be against current_user_ns(). Changelog: Jan 11: Use task_ns_capable() in place of sched_capable(). Jan 11: Use nsown_capable() as suggested by Bastian Blank. Jan 11: Clarify (hopefully) some logic in futex and sched.c Feb 15: use ns_capable for ipc, not nsown_capable Feb 23: let copy_ipcs handle setting ipc_ns->user_ns Feb 23: pass ns down rather than taking it from current [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:08 -07:00
Serge E. Hallyn	b515498f5b	userns: add a user namespace owner of ipc ns Changelog: Feb 15: Don't set new ipc->user_ns if we didn't create a new ipc_ns. Feb 23: Move extern declaration to ipc_namespace.h, and group fwd declarations at top. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:07 -07:00
Serge E. Hallyn	fc832ad364	userns: user namespaces: convert all capable checks in kernel/sys.c This allows setuid/setgid in containers. It also fixes some corner cases where kernel logic foregoes capability checks when uids are equivalent. The latter will need to be done throughout the whole kernel. Changelog: Jan 11: Use nsown_capable() as suggested by Bastian Blank. Jan 11: Fix logic errors in uid checks pointed out by Bastian. Feb 15: allow prlimit to current (was regression in previous version) Feb 23: remove debugging printks, uninline set_one_prio_perm and make it bool, and document its return value. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:06 -07:00
Serge E. Hallyn	3263245de4	userns: make has_capability* into real functions So we can let type safety keep things sane, and as a bonus we can remove the declaration of init_user_ns in capability.h. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:06 -07:00
Serge E. Hallyn	8409cca705	userns: allow ptrace from non-init user namespaces ptrace is allowed to tasks in the same user namespace according to the usual rules (i.e. the same rules as for two tasks in the init user namespace). ptrace is also allowed to a user namespace to which the current task the has CAP_SYS_PTRACE capability. Changelog: Dec 31: Address feedback by Eric: . Correct ptrace uid check . Rename may_ptrace_ns to ptrace_capable . Also fix the cap_ptrace checks. Jan 1: Use const cred struct Jan 11: use task_ns_capable() in place of ptrace_capable(). Feb 23: same_or_ancestore_user_ns() was not an appropriate check to constrain cap_issubset. Rather, cap_issubset() only is meaningful when both capsets are in the same user_ns. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:05 -07:00
Serge E. Hallyn	39fd33933b	userns: allow killing tasks in your own or child userns Changelog: Dec 8: Fixed bug in my check_kill_permission pointed out by Eric Biederman. Dec 13: Apply Eric's suggestion to pass target task into kill_ok_by_cred() for clarity Dec 31: address comment by Eric Biederman: don't need cred/tcred in check_kill_permission. Jan 1: use const cred struct. Jan 11: Per Bastian Blank's advice, clean up kill_ok_by_cred(). Feb 16: kill_ok_by_cred: fix bad parentheses Feb 23: per akpm, let compiler inline kill_ok_by_cred Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:04 -07:00
Serge E. Hallyn	bb96a6f50b	userns: allow sethostname in a container Changelog: Feb 23: let clone_uts_ns() handle setting uts->user_ns To do so we need to pass in the task_struct who'll get the utsname, so we can get its user_ns. Feb 23: As per Oleg's coment, just pass in tsk, instead of two of its members. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:03 -07:00
Serge E. Hallyn	3486740a4f	userns: security: make capabilities relative to the user namespace - Introduce ns_capable to test for a capability in a non-default user namespace. - Teach cap_capable to handle capabilities in a non-default user namespace. The motivation is to get to the unprivileged creation of new namespaces. It looks like this gets us 90% of the way there, with only potential uid confusion issues left. I still need to handle getting all caps after creation but otherwise I think I have a good starter patch that achieves all of your goals. Changelog: 11/05/2010: [serge] add apparmor 12/14/2010: [serge] fix capabilities to created user namespaces Without this, if user serge creates a user_ns, he won't have capabilities to the user_ns he created. THis is because we were first checking whether his effective caps had the caps he needed and returning -EPERM if not, and THEN checking whether he was the creator. Reverse those checks. 12/16/2010: [serge] security_real_capable needs ns argument in !security case 01/11/2011: [serge] add task_ns_capable helper 01/11/2011: [serge] add nsown_capable() helper per Bastian Blank suggestion 02/16/2011: [serge] fix a logic bug: the root user is always creator of init_user_ns, but should not always have capabilities to it! Fix the check in cap_capable(). 02/21/2011: Add the required user_ns parameter to security_capable, fixing a compile failure. 02/23/2011: Convert some macros to functions as per akpm comments. Some couldn't be converted because we can't easily forward-declare them (they are inline if !SECURITY, extern if SECURITY). Add a current_user_ns function so we can use it in capability.h without #including cred.h. Move all forward declarations together to the top of the #ifdef __KERNEL__ section, and use kernel-doc format. 02/23/2011: Per dhowells, clean up comment in cap_capable(). 02/23/2011: Per akpm, remove unreachable 'return -EPERM' in cap_capable. (Original written and signed off by Eric; latest, modified version acked by him) [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: export current_user_ns() for ecryptfs] [serge.hallyn@canonical.com: remove unneeded extra argument in selinux's task_has_capability] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:02 -07:00
Serge E. Hallyn	59607db367	userns: add a user_namespace as creator/owner of uts_namespace The expected course of development for user namespaces targeted capabilities is laid out at https://wiki.ubuntu.com/UserNamespace. Goals: - Make it safe for an unprivileged user to unshare namespaces. They will be privileged with respect to the new namespace, but this should only include resources which the unprivileged user already owns. - Provide separate limits and accounting for userids in different namespaces. Status: Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID capabilities. What this gets you is a whole new set of userids, meaning that user 500 will have a different 'struct user' in your namespace than in other namespaces. So any accounting information stored in struct user will be unique to your namespace. However, throughout the kernel there are checks which - simply check for a capability. Since root in a child namespace has all capabilities, this means that a child namespace is not constrained. - simply compare uid1 == uid2. Since these are the integer uids, uid 500 in namespace 1 will be said to be equal to uid 500 in namespace 2. As a result, the lxc implementation at lxc.sf.net does not use user namespaces. This is actually helpful because it leaves us free to develop user namespaces in such a way that, for some time, user namespaces may be unuseful. Bugs aside, this patchset is supposed to not at all affect systems which are not actively using user namespaces, and only restrict what tasks in child user namespace can do. They begin to limit privilege to a user namespace, so that root in a container cannot kill or ptrace tasks in the parent user namespace, and can only get world access rights to files. Since all files currently belong to the initila user namespace, that means that child user namespaces can only get world access rights to all files. While this temporarily makes user namespaces bad for system containers, it starts to get useful for some sandboxing. I've run the 'runltplite.sh' with and without this patchset and found no difference. This patch: copy_process() handles CLONE_NEWUSER before the rest of the namespaces. So in the case of clone(CLONE_NEWUSER\|CLONE_NEWUTS) the new uts namespace will have the new user namespace as its owner. That is what we want, since we want root in that new userns to be able to have privilege over it. Changelog: Feb 15: don't set uts_ns->user_ns if we didn't create a new uts_ns. Feb 23: Move extern init_user_ns declaration from init/version.c to utsname.h. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:59 -07:00
Eric W. Biederman	4308eebbeb	pidns: call pid_ns_prepare_proc() from create_pid_namespace() Reorganize proc_get_sb() so it can be called before the struct pid of the first process is allocated. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:58 -07:00
Eric W. Biederman	45a68628d3	pid: remove the child_reaper special case in init/main.c This patchset is a cleanup and a preparation to unshare the pid namespace. These prerequisites prepare for Eric's patchset to give a file descriptor to a namespace and join an existing namespace. This patch: It turns out that the existing assignment in copy_process of the child_reaper can handle the initial assignment of child_reaper we just need to generalize the test in kernel/fork.c Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:57 -07:00
Richard Weinberger	bfdc0b497f	sysctl: restrict write access to dmesg_restrict When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel ring buffer. But a root user without CAP_SYS_ADMIN is able to reset dmesg_restrict to 0. This is an issue when e.g. LXC (Linux Containers) are used and complete user space is running without CAP_SYS_ADMIN. A unprivileged and jailed root user can bypass the dmesg_restrict protection. With this patch writing to dmesg_restrict is only allowed when root has CAP_SYS_ADMIN. Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Dan Rosenberg <drosenberg@vsecurity.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Cc: Eric Paris <eparis@redhat.com> Cc: Kees Cook <kees.cook@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:54 -07:00
Petr Holasek	cb16e95fa2	sysctl: add some missing input constraint checks Add boundaries of allowed input ranges for: dirty_expire_centisecs, drop_caches, overcommit_memory, page-cluster and panic_on_oom. Signed-off-by: Petr Holasek <pholasek@redhat.com> Acked-by: Dave Young <hidave.darkstar@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:51 -07:00
Denis Kirjanov	256c53a651	sysctl_check: drop dead code Drop dead code. Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:51 -07:00
Denis Kirjanov	814ecf6e5b	sysctl_check: drop table->procname checks Since the for loop checks for the table->procname drop useless table->procname checks inside the loop body Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:50 -07:00
Li Zefan	523fb486bf	cpuset: hold callback_mutex in cpuset_post_clone() Chaning cpuset->mems/cpuset->cpus should be protected under callback_mutex. cpuset_clone() doesn't follow this rule. It's ok because it's called when creating and initializing a cgroup, but we'd better hold the lock to avoid subtil break in the future. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:35 -07:00
Li Zefan	ee24d37977	cpuset: fix unchecked calls to NODEMASK_ALLOC() Those functions that use NODEMASK_ALLOC() can't propagate errno to users, but will fail silently. Fix it by using a static nodemask_t variable for each function, and those variables are protected by cgroup_mutex; [akpm@linux-foundation.org: fix comment spelling, strengthen cgroup_lock comment] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:35 -07:00
Li Zefan	c8163ca8af	cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_attach() oldcs->mems_allowed is not modified during cpuset_attach(), so we don't have to copy it to a buffer allocated by NODEMASK_ALLOC(). Just pass it to cpuset_migrate_mm(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:34 -07:00
Li Zefan	9303e0c481	cpuset: remove unneeded NODEMASK_ALLOC() in cpuset_sprintf_memlist() It's not necessary to copy cpuset->mems_allowed to a buffer allocated by NODEMASK_ALLOC(). Just pass it to nodelist_scnprintf(). As spotted by Paul, a side effect is we fix a bug that the function can return -ENOMEM but the caller doesn't expect negative return value. Therefore change the return value of cpuset_sprintf_cpulist() and cpuset_sprintf_memlist() from int to size_t. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:34 -07:00
Johannes Weiner	6b3ae58efc	memcg: remove direct page_cgroup-to-page pointer In struct page_cgroup, we have a full word for flags but only a few are reserved. Use the remaining upper bits to encode, depending on configuration, the node or the section, to enable page_cgroup-to-page lookups without a direct pointer. This saves a full word for every page in a system with memory cgroups enabled. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:28 -07:00
KAMEZAWA Hiroyuki	6c191cd01a	memcg: res_counter_read_u64(): fix potential races on 32-bit machines res_counter_read_u64 reads u64 value without lock. It's dangerous in a 32bit environment. Add locking. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:22 -07:00
Rafael J. Wysocki	e1a85b2c51	timekeeping: Use syscore_ops instead of sysdev class and sysdev The timekeeping subsystem uses a sysdev class and a sysdev for executing timekeeping_suspend() after interrupts have been turned off on the boot CPU (during system suspend) and for executing timekeeping_resume() before turning on interrupts on the boot CPU (during system resume). However, since both of these functions ignore their arguments, the entire mechanism may be replaced with a struct syscore_ops object which is simpler. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-23 22:16:04 +01:00
Stephen Wilson	cae5d39032	mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm Now that gate vma's are referenced with respect to a particular mm and not a particular task it only makes sense to propagate the change to this predicate as well. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:55 -04:00
Frederic Weisbecker	880f573184	perf: Better fit max unprivileged mlock pages for tools needs The maximum kilobytes of locked memory that an unprivileged user can reserve is of 512 kB = 128 pages by default, scaled to the number of onlined CPUs, which fits well with the tools that use 128 data pages by default. However tools actually use 129 pages, because they need one more for the user control page. Thus the default mlock threshold is not sufficient for the default tools needs and we always end up to evaluate the constant mlock rlimit policy, which doesn't have this scaling with the number of online CPUs. Hence, on systems that have more than 16 CPUs, we overlap the rlimit threshold and fail to mmap: $ perf record ls Error: failed to mmap with 1 (Operation not permitted) Just increase the max unprivileged mlock threshold by one page so that it supports well perf tools even after 16 CPUs. Reported-by: Han Pingtian <phan@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Stable <stable@kernel.org> LKML-Reference: <1300904979-5508-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-23 20:57:04 +01:00
Thomas Gleixner	3b90389128	genirq; Remove the last leftovers of the old sparse irq code All users converted. Get rid of it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-23 20:22:06 +01:00
Stephane Eranian	68cacd2916	perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx() This patch solves a stale pointer problem in update_cgrp_time_from_cpuctx(). The cpuctx->cgrp was not cleared on all possible event exit paths, including: close() perf_release() perf_release_kernel() list_del_event() This patch fixes list_del_event() to clear cpuctx->cgrp when there are no cgroup events left in the context. [ This second version makes the code compile when CONFIG_CGROUP_PERF is not enabled. We unconditionally define perf_cpu_context->cgrp. ] Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: perfmon2-devel@lists.sf.net Cc: paulus@samba.org Cc: davem@davemloft.net LKML-Reference: <20110323150306.GA1580@quad> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-23 16:07:22 +01:00
Borislav Petkov	1232d6132a	sched, doc: Update sched-design-CFS.txt Correct ->dequeue_tree() thinko into sched_class->dequeue_task and drop all references to ->task_new() since it is obviously gone. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <1300815978-16618-1-git-send-email-bp@amd64.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-23 14:09:41 +01:00
Sergey Senozhatsky	dec2960827	lockdep: Remove unused 'factor' variable from lockdep_stats_show() Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110323123828.GB4244@swordfish.minsk.epam.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-23 13:54:47 +01:00
Sergey Senozhatsky	20dd674071	sched: Remove unused 'rq' variable and cpu_rq() call from alloc_fair_sched_group() Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110323111722.GA4244@swordfish.minsk.epam.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-23 13:27:58 +01:00
Tejun Heo	244056f9db	job control: Don't send duplicate job control stop notification while ptraced Just as group_exit_code shouldn't be generated when a PTRACE_CONT'd task re-enters job control stop, notifiction for the event should be suppressed too. The logic is the same as the group_exit_code generation suppression in do_signal_stop(), if SIGNAL_STOP_STOPPED is already set, the task is re-entering job control stop without intervening SIGCONT and the notifications should be suppressed. Test case follows. #include <stdio.h> #include <unistd.h> #include <signal.h> #include <time.h> #include <sys/ptrace.h> #include <sys/wait.h> static const struct timespec ts100ms = { .tv_nsec = 100000000 }; static pid_t tracee, tracer; static const char pid_who(pid_t pid) { return pid == tracee ? "tracee" : (pid == tracer ? "tracer" : "mommy "); } static void sigchld_sigaction(int signo, siginfo_t si, void ucxt) { printf("%s: SIG status=%02d code=%02d (%s)\n", pid_who(getpid()), si->si_status, si->si_code, pid_who(si->si_pid)); } int main(void) { const struct sigaction chld_sa = { .sa_sigaction = sigchld_sigaction, .sa_flags = SA_SIGINFO\|SA_RESTART }; siginfo_t si; sigaction(SIGCHLD, &chld_sa, NULL); tracee = fork(); if (!tracee) { tracee = getpid(); while (1) pause(); } kill(tracee, SIGSTOP); waitid(P_PID, tracee, &si, WSTOPPED); tracer = fork(); if (!tracer) { tracer = getpid(); ptrace(PTRACE_ATTACH, tracee, NULL, NULL); waitid(P_PID, tracee, &si, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, (void )(long)si.si_status); waitid(P_PID, tracee, &si, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status); waitid(P_PID, tracee, &si, WSTOPPED); printf("tracer: detaching\n"); ptrace(PTRACE_DETACH, tracee, NULL, NULL); return 0; } while (1) pause(); return 0; } Before the patch, the parent gets the second notification for the tracee after the tracer detaches. si_status is zero because group_exit_code is not set by the group stop completion which triggered this notification. mommy : SIG status=19 code=05 (tracee) tracer: SIG status=00 code=05 (tracee) tracer: SIG status=19 code=04 (tracee) tracer: SIG status=00 code=05 (tracee) tracer: detaching mommy : SIG status=00 code=05 (tracee) mommy : SIG status=00 code=01 (tracer) ^C After the patch, the duplicate notification is gone. mommy : SIG status=19 code=05 (tracee) tracer: SIG status=00 code=05 (tracee) tracer: SIG status=19 code=04 (tracee) tracer: SIG status=00 code=05 (tracee) tracer: detaching mommy : SIG status=00 code=01 (tracer) ^C Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	ceb6bd67f9	job control: Notify the real parent of job control events regardless of ptrace With recent changes, job control and ptrace stopped states are properly separated and accessible to the real parent and the ptracer respectively; however, notifications of job control stopped/continued events to the real parent while ptraced are still missing. A ptracee participates in group stop in ptrace_stop() but the completion isn't notified. If participation results in completion of group stop, notify the real parent of the event. The ptrace and group stops are separate and can be handled as such. However, when the real parent and the ptracer are in the same thread group, only the ptrace stop event is visible through wait(2) and the duplicate notifications are different from the current behavior and are confusing. Suppress group stop notification in such cases. The continued state is shared between the real parent and the ptracer but is only meaningful to the real parent. Always notify the real parent and notify the ptracer too for backward compatibility. Similar to stop notification, if the real parent is the ptracer, suppress a duplicate notification. Test case follows. #include <stdio.h> #include <unistd.h> #include <time.h> #include <errno.h> #include <sys/types.h> #include <sys/ptrace.h> #include <sys/wait.h> int main(void) { const struct timespec ts100ms = { .tv_nsec = 100000000 }; pid_t tracee, tracer; siginfo_t si; int i; tracee = fork(); if (tracee == 0) { while (1) { printf("tracee: SIGSTOP\n"); raise(SIGSTOP); nanosleep(&ts100ms, NULL); printf("tracee: SIGCONT\n"); raise(SIGCONT); nanosleep(&ts100ms, NULL); } } waitid(P_PID, tracee, &si, WSTOPPED \| WNOHANG \| WNOWAIT); tracer = fork(); if (tracer == 0) { nanosleep(&ts100ms, NULL); ptrace(PTRACE_ATTACH, tracee, NULL, NULL); for (i = 0; i < 11; i++) { si.si_pid = 0; waitid(P_PID, tracee, &si, WSTOPPED); if (si.si_pid && si.si_code == CLD_TRAPPED) ptrace(PTRACE_CONT, tracee, NULL, (void )(long)si.si_status); } printf("tracer: EXITING\n"); return 0; } while (1) { si.si_pid = 0; waitid(P_PID, tracee, &si, WSTOPPED \| WCONTINUED \| WEXITED); if (si.si_pid) printf("mommy : WAIT status=%02d code=%02d\n", si.si_status, si.si_code); } return 0; } Before this patch, while ptraced, the real parent doesn't get notifications for job control events, so although it can access those events, the later waitid(2) call never wakes up. tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT tracee: SIGSTOP tracee: SIGCONT tracee: SIGSTOP tracee: SIGCONT tracee: SIGSTOP tracer: EXITING mommy : WAIT status=19 code=05 ^C After this patch, it works as expected. tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP tracer: EXITING mommy : WAIT status=19 code=05 ^C -v2: Oleg pointed out that Group stop notification to the real parent should also happen when ptracer detach races with ptrace_stop(). * real_parent_is_ptracer() should be testing thread group equality not the task itself as wait(2) and stop/cont notifications are normally thread-group wide. Both issues are fixed accordingly. -v3: real_parent_is_ptracer() updated to test child->real_parent instead of child->group_leader->real_parent per Oleg's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	62bcf9d992	job control: Job control stop notifications should always go to the real parent The stopped notifications in do_signal_stop() and exit_signals() are always for the completion of job control. The one in do_signal_stop() may be delivered to the ptracer if PTRACE_ATTACH races with notification and the one in exit_signals() if task exits while ptraced. In both cases, the notifications are meaningless and confusing to the ptracer as it never accesses the group stop state while the real parent would miss notifications for the events it is watching. Make sure these notifications always go to the real parent by calling do_notify_parent_cld_stop() with %false @for_ptrace. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	75b95953a5	job control: Add @for_ptrace to do_notify_parent_cldstop() Currently, do_notify_parent_cldstop() determines whether the notification is for the real parent or ptracer. Move the decision to the caller by adding @for_ptrace parameter to do_notify_parent_cldstop(). All the callers are updated to pass task_ptrace(target_task), so this patch doesn't cause any behavior difference. While at it, add function comment to do_notify_parent_cldstop(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	45cb24a1da	job control: Allow access to job control events through ptracees Currently a real parent can't access job control stopped/continued events through a ptraced child. This utterly breaks job control when the children are ptraced. For example, if a program is run from an interactive shell and then strace(1) attaches to it, pressing ^Z would send SIGTSTP and strace(1) would notice it but the shell has no way to tell whether the child entered job control stop and thus can't tell when to take over the terminal - leading to awkward lone ^Z on the terminal. Because the job control and ptrace stopped states are independent, there is no reason to prevent real parents from accessing the stopped state regardless of ptrace. The continued state isn't separate but ptracers don't have any use for them as ptracees can never resume without explicit command from their ptracers, so as long as ptracers don't consume it, it should be fine. Although this is a behavior change, because the previous behavior is utterly broken when viewed from real parents and the change is only visible to real parents, I don't think it's necessary to make this behavior optional. One situation to be careful about is when a task from the real parent's group is ptracing. The parent group is the recipient of both ptrace and job control stop events and one stop can be reported as both job control and ptrace stops. As this can break the current ptrace users, suppress job control stopped events for these cases. If a real parent ptracer wants to know about both job control and ptrace stops, it can create a separate process to serve the role of real parent. Note that this only updates wait(2) side of things. The real parent can access the states via wait(2) but still is not properly notified (woken up and delivered signal). Test case polls wait(2) with WNOHANG to work around. Notification will be updated by future patches. Test case follows. #include <stdio.h> #include <unistd.h> #include <time.h> #include <errno.h> #include <sys/types.h> #include <sys/ptrace.h> #include <sys/wait.h> int main(void) { const struct timespec ts100ms = { .tv_nsec = 100000000 }; pid_t tracee, tracer; siginfo_t si; int i; tracee = fork(); if (tracee == 0) { while (1) { printf("tracee: SIGSTOP\n"); raise(SIGSTOP); nanosleep(&ts100ms, NULL); printf("tracee: SIGCONT\n"); raise(SIGCONT); nanosleep(&ts100ms, NULL); } } waitid(P_PID, tracee, &si, WSTOPPED \| WNOHANG \| WNOWAIT); tracer = fork(); if (tracer == 0) { nanosleep(&ts100ms, NULL); ptrace(PTRACE_ATTACH, tracee, NULL, NULL); for (i = 0; i < 11; i++) { si.si_pid = 0; waitid(P_PID, tracee, &si, WSTOPPED); if (si.si_pid && si.si_code == CLD_TRAPPED) ptrace(PTRACE_CONT, tracee, NULL, (void *)(long)si.si_status); } printf("tracer: EXITING\n"); return 0; } while (1) { si.si_pid = 0; waitid(P_PID, tracee, &si, WSTOPPED \| WCONTINUED \| WEXITED \| WNOHANG); if (si.si_pid) printf("mommy : WAIT status=%02d code=%02d\n", si.si_status, si.si_code); nanosleep(&ts100ms, NULL); } return 0; } Before the patch, while ptraced, the parent can't see any job control events. tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT tracee: SIGSTOP tracee: SIGCONT tracee: SIGSTOP tracee: SIGCONT tracee: SIGSTOP tracer: EXITING mommy : WAIT status=19 code=05 ^C After the patch, tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP mommy : WAIT status=19 code=05 tracee: SIGCONT mommy : WAIT status=18 code=06 tracee: SIGSTOP tracer: EXITING mommy : WAIT status=19 code=05 ^C -v2: Oleg pointed out that wait(2) should be suppressed for the real parent's group instead of only the real parent task itself. Updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	9b84cca256	job control: Fix ptracer wait(2) hang and explain notask_error clearing wait(2) and friends allow access to stopped/continued states through zombies, which is required as the states are process-wide and should be accessible whether the leader task is alive or undead. wait_consider_task() implements this by always clearing notask_error and going through wait_task_stopped/continued() for unreaped zombies. However, while ptraced, the stopped state is per-task and as such if the ptracee became a zombie, there's no further stopped event to listen to and wait(2) and friends should return -ECHILD on the tracee. Fix it by clearing notask_error only if WCONTINUED \| WEXITED is set for ptraced zombies. While at it, document why clearing notask_error is safe for each case. Test case follows. #include <stdio.h> #include <unistd.h> #include <pthread.h> #include <time.h> #include <sys/types.h> #include <sys/ptrace.h> #include <sys/wait.h> static void nooper(void arg) { pause(); return NULL; } int main(void) { const struct timespec ts1s = { .tv_sec = 1 }; pid_t tracee, tracer; siginfo_t si; tracee = fork(); if (tracee == 0) { pthread_t thr; pthread_create(&thr, NULL, nooper, NULL); nanosleep(&ts1s, NULL); printf("tracee exiting\n"); pthread_exit(NULL); /* let subthread run / } tracer = fork(); if (tracer == 0) { ptrace(PTRACE_ATTACH, tracee, NULL, NULL); while (1) { if (waitid(P_PID, tracee, &si, WSTOPPED) < 0) { perror("waitid"); break; } ptrace(PTRACE_CONT, tracee, NULL, (void )(long)si.si_status); } return 0; } waitid(P_PID, tracer, &si, WEXITED); kill(tracee, SIGKILL); return 0; } Before the patch, after the tracee becomes a zombie, the tracer's waitid(WSTOPPED) never returns and the program doesn't terminate. tracee exiting ^C After the patch, tracee exiting triggers waitid() to fail. tracee exiting waitid: No child processes -v2: Oleg pointed out that exited in addition to continued can happen for ptraced dead group leader. Clear notask_error for ptraced child on WEXITED too. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	823b018e5b	job control: Small reorganization of wait_consider_task() Move EXIT_DEAD test in wait_consider_task() above ptrace check. As ptraced tasks can't be EXIT_DEAD, this change doesn't cause any behavior change. This is to prepare for further changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	408a37de6c	job control: Don't set group_stop exit_code if re-entering job control stop While ptraced, a task may be resumed while the containing process is still job control stopped. If the task receives another stop signal in this state, it will still initiate group stop, which generates group_exit_code, which the real parent would be able to see once the ptracer detaches. In this scenario, the real parent may see two consecutive CLD_STOPPED events from two stop signals without intervening SIGCONT, which normally is impossible. Test case follows. #include <stdio.h> #include <unistd.h> #include <sys/ptrace.h> #include <sys/wait.h> int main(void) { pid_t tracee; siginfo_t si; tracee = fork(); if (!tracee) while (1) pause(); kill(tracee, SIGSTOP); waitid(P_PID, tracee, &si, WSTOPPED); if (!fork()) { ptrace(PTRACE_ATTACH, tracee, NULL, NULL); waitid(P_PID, tracee, &si, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, (void )(long)si.si_status); waitid(P_PID, tracee, &si, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, (void )(long)si.si_status); waitid(P_PID, tracee, &si, WSTOPPED); ptrace(PTRACE_DETACH, tracee, NULL, NULL); return 0; } while (1) { si.si_pid = 0; waitid(P_PID, tracee, &si, WSTOPPED \| WNOHANG); if (si.si_pid) printf("st=%02d c=%02d\n", si.si_status, si.si_code); } return 0; } Before the patch, the latter waitid() in polling mode reports the second stopped event generated by the implied SIGSTOP of PTRACE_ATTACH. st=19 c=05 ^C After the patch, the second event is not reported. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	0e9f0a4abf	ptrace: Always put ptracee into appropriate execution state Currently, __ptrace_unlink() wakes up the tracee iff it's in TASK_TRACED. For unlinking from PTRACE_DETACH, this is correct as the tracee is guaranteed to be in TASK_TRACED or dead; however, unlinking also happens when the ptracer exits and in this case the ptracee can be in any state and ptrace might be left running even if the group it belongs to is stopped. This patch updates __ptrace_unlink() such that GROUP_STOP_PENDING is reinstated regardless of the ptracee's current state as long as it's alive and makes sure that signal_wake_up() is called if execution state transition is necessary. Test case follows. #include <unistd.h> #include <time.h> #include <sys/types.h> #include <sys/ptrace.h> #include <sys/wait.h> static const struct timespec ts1s = { .tv_sec = 1 }; int main(void) { pid_t tracee; siginfo_t si; tracee = fork(); if (tracee == 0) { while (1) { nanosleep(&ts1s, NULL); write(1, ".", 1); } } ptrace(PTRACE_ATTACH, tracee, NULL, NULL); waitid(P_PID, tracee, &si, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, (void )(long)si.si_status); waitid(P_PID, tracee, &si, WSTOPPED); ptrace(PTRACE_CONT, tracee, NULL, (void )(long)si.si_status); write(1, "exiting", 7); return 0; } Before the patch, after the parent process exits, the child is left running and prints out "." every second. exiting..... (continues) After the patch, the group stop initiated by the implied SIGSTOP from PTRACE_ATTACH is re-established when the parent exits. exiting Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	e3bd058f62	ptrace: Collapse ptrace_untrace() into __ptrace_unlink() Remove the extra task_is_traced() check in __ptrace_unlink() and collapse ptrace_untrace() into __ptrace_unlink(). This is to prepare for further changes. While at it, drop the comment on top of ptrace_untrace() and convert __ptrace_unlink() comment to docbook format. Detailed comment will be added by the next patch. This patch doesn't cause any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>	2011-03-23 10:37:01 +01:00
Tejun Heo	d79fdd6d96	ptrace: Clean transitions between TASK_STOPPED and TRACED Currently, if the task is STOPPED on ptrace attach, it's left alone and the state is silently changed to TRACED on the next ptrace call. The behavior breaks the assumption that arch_ptrace_stop() is called before any task is poked by ptrace and is ugly in that a task manipulates the state of another task directly. With GROUP_STOP_PENDING, the transitions between TASK_STOPPED and TRACED can be made clean. The tracer can use the flag to tell the tracee to retry stop on attach and detach. On retry, the tracee will enter the desired state in the correct way. The lower 16bits of task->group_stop is used to remember the signal number which caused the last group stop. This is used while retrying for ptrace attach as the original group_exit_code could have been consumed with wait(2) by then. As the real parent may wait(2) and consume the group_exit_code anytime, the group_exit_code needs to be saved separately so that it can be used when switching from regular sleep to ptrace_stop(). This is recorded in the lower 16bits of task->group_stop. If a task is already stopped and there's no intervening SIGCONT, a ptrace request immediately following a successful PTRACE_ATTACH should always succeed even if the tracer doesn't wait(2) for attach completion; however, with this change, the tracee might still be TASK_RUNNING trying to enter TASK_TRACED which would cause the following request to fail with -ESRCH. This intermediate state is hidden from the ptracer by setting GROUP_STOP_TRAPPING on attach and making ptrace_check_attach() wait for it to clear on its signal->wait_chldexit. Completing the transition or getting killed clears TRAPPING and wakes up the tracer. Note that the STOPPED -> RUNNING -> TRACED transition is still visible to other threads which are in the same group as the ptracer and the reverse transition is visible to all. Please read the comments for details. Oleg: * Spotted a race condition where a task may retry group stop without proper bookkeeping. Fixed by redoing bookkeeping on retry. * Spotted that the transition is visible to userland in several different ways. Most are fixed with GROUP_STOP_TRAPPING. Unhandled corner case is documented. * Pointed out not setting GROUP_STOP_SIGMASK on an already stopped task would result in more consistent behavior. * Pointed out that calling ptrace_stop() from do_signal_stop() in TASK_STOPPED can race with group stop start logic and then confuse the TRAPPING wait in ptrace_check_attach(). ptrace_stop() is now called with TASK_RUNNING. * Suggested using signal->wait_chldexit instead of bit wait. * Spotted a race condition between TRACED transition and clearing of TRAPPING. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	5224fa3660	ptrace: Make do_signal_stop() use ptrace_stop() if the task is being ptraced A ptraced task would still stop at do_signal_stop() when it's stopping for stop signals and do_signal_stop() behaves the same whether the task is ptraced or not. However, in addition to stopping, ptrace_stop() also does ptrace specific stuff like calling architecture specific callbacks, so this behavior makes the code more fragile and difficult to understand. This patch makes do_signal_stop() test whether the task is ptraced and use ptrace_stop() if so. This renders tracehook_notify_jctl() rather pointless as the ptrace notification is now handled by ptrace_stop() regardless of the return value from the tracehook. It probably is a good idea to update it. This doesn't solve the whole problem as tasks already in stopped state would stay in the regular stop when ptrace attached. That part will be handled by the next patch. Oleg pointed out that this makes a userland-visible change. Before, SIGCONT would be able to wake up a task in group stop even if the task is ptraced if the tracer hasn't issued another ptrace command afterwards (as the next ptrace commands transitions the state into TASK_TRACED which ignores SIGCONT wakeups). With this and the next patch, SIGCONT may race with the transition into TASK_TRACED and is ignored if the tracee already entered TASK_TRACED. Another userland visible change of this and the next patch is that the ptracee's state would now be TASK_TRACED where it used to be TASK_STOPPED, which is visible via fs/proc. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	0ae8ce1c8c	ptrace: Participate in group stop from ptrace_stop() iff the task is trapping for group stop Currently, ptrace_stop() unconditionally participates in group stop bookkeeping. This is unnecessary and inaccurate. Make it only participate if the task is trapping for group stop - ie. if @why is CLD_STOPPED. As ptrace_stop() currently is not used when trapping for group stop, this equals to disabling group stop participation from ptrace_stop(). A visible behavior change is increased likelihood of delayed group stop completion if the thread group contains one or more ptraced tasks. This is to preapre for further cleanup of the interaction between group stop and ptrace. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	39efa3ef3a	signal: Use GROUP_STOP_PENDING to stop once for a single group stop Currently task->signal->group_stop_count is used to decide whether to stop for group stop. However, if there is a task in the group which is taking a long time to stop, other tasks which are continued by ptrace would repeatedly stop for the same group stop until the group stop is complete. Conversely, if a ptraced task is in TASK_TRACED state, the debugger won't get notified of group stops which is inconsistent compared to the ptraced task in any other state. This patch introduces GROUP_STOP_PENDING which tracks whether a task is yet to stop for the group stop in progress. The flag is set when a group stop starts and cleared when the task stops the first time for the group stop, and consulted whenever whether the task should participate in a group stop needs to be determined. Note that now tasks in TASK_TRACED also participate in group stop. This results in the following behavior changes. * For a single group stop, a ptracer would see at most one stop reported. * A ptracee in TASK_TRACED now also participates in group stop and the tracer would get the notification. However, as a ptraced task could be in TASK_STOPPED state or any ptrace trap could consume group stop, the notification may still be missing. These will be addressed with further patches. * A ptracee may start a group stop while one is still in progress if the tracer let it continue with stop signal delivery. Group stop code handles this correctly. Oleg: * Spotted that a task might skip signal check even when its GROUP_STOP_PENDING is set. Fixed by updating recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of group_stop_count. * Pointed out that task->group_stop should be cleared whenever task->signal->group_stop_count is cleared. Fixed accordingly. * Pointed out the behavior inconsistency between TASK_TRACED and RUNNING and the last behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	e5c1902e92	signal: Fix premature completion of group stop when interfered by ptrace task->signal->group_stop_count is used to track the progress of group stop. It's initialized to the number of tasks which need to stop for group stop to finish and each stopping or trapping task decrements. However, each task doesn't keep track of whether it decremented the counter or not and if woken up before the group stop is complete and stops again, it can decrement the counter multiple times. Please consider the following example code. static void worker(void arg) { while (1) ; return NULL; } int main(void) { pthread_t thread; pid_t pid; int i; pid = fork(); if (!pid) { for (i = 0; i < 5; i++) pthread_create(&thread, NULL, worker, NULL); while (1) ; return 0; } ptrace(PTRACE_ATTACH, pid, NULL, NULL); while (1) { waitid(P_PID, pid, NULL, WSTOPPED); ptrace(PTRACE_SINGLESTEP, pid, NULL, (void *)(long)SIGSTOP); } return 0; } The child creates five threads and the parent continuously traps the first thread and whenever the child gets a signal, SIGSTOP is delivered. If an external process sends SIGSTOP to the child, all other threads in the process should reliably stop. However, due to the above bug, the first thread will often end up consuming group_stop_count multiple times and SIGSTOP often ends up stopping none or part of the other four threads. This patch adds a new field task->group_stop which is protected by siglock and uses GROUP_STOP_CONSUME flag to track which task is still to consume group_stop_count to fix this bug. task_clear_group_stop_pending() and task_participate_group_stop() are added to help manipulating group stop states. As ptrace_stop() now also uses task_participate_group_stop(), it will set SIGNAL_STOP_STOPPED if it completes a group stop. There still are many issues regarding the interaction between group stop and ptrace. Patches to address them will follow. - Oleg spotted duplicate GROUP_STOP_CONSUME. Dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	fe1bc6a095	ptrace: Add @why to ptrace_stop() To prepare for cleanup of the interaction between group stop and ptrace, add @why to ptrace_stop(). Existing users are updated such that there is no behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Roland McGrath <roland@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	edf2ed153b	ptrace: Kill tracehook_notify_jctl() tracehook_notify_jctl() aids in determining whether and what to report to the parent when a task is stopped or continued. The function also adds an extra requirement that siglock may be released across it, which is currently unused and quite difficult to satisfy in well-defined manner. As job control and the notifications are about to receive major overhaul, remove the tracehook and open code it. If ever necessary, let's factor it out after the overhaul. * Oleg spotted incorrect CLD_CONTINUED/STOPPED selection when ptraced. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	71db5eb99c	signal: Remove superflous try_to_freeze() loop in do_signal_stop() do_signal_stop() is used only by get_signal_to_deliver() and after a successful signal stop, it always calls try_to_freeze(), so the try_to_freeze() loop around schedule() in do_signal_stop() is superflous and confusing. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	9f2bf6513a	ptrace: Remove the extra wake_up_state() from ptrace_detach() This wake_up_state() has a turbulent history. This is a remnant from ancient ptrace implementation and patently wrong. Commit `95a3540d` (ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logic) removed it but the change was reverted later by commit `edaba2c5` (ptrace: revert "ptrace_detach: the wrong wakeup breaks the ERESTARTxxx logic") citing compatibility breakage and general brokeness of the whole group stop / ptrace interaction. Then, recently, it got converted from wake_up_process() to wake_up_state() to make it less dangerous. Digging through the mailing archives, the compatibility breakage doesn't seem to be critical in the sense that the behavior isn't well defined or reliable to begin with and it seems to have been agreed to remove the wakeup with proper cleanup of the whole thing. Now that the group stop and its interaction with ptrace are being cleaned up, it's high time to finally kill this silliness. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com>	2011-03-23 10:37:00 +01:00
Tejun Heo	c672af35d5	signal: Fix SIGCONT notification code After a task receives SIGCONT, its parent is notified via SIGCHLD with its siginfo describing what the notified event is. If SIGCONT is received while the child process is stopped, the code should be CLD_CONTINUED. If SIGCONT is recieved while the child process is in the process of being stopped, it should be CLD_STOPPED. Which code to use is determined in prepare_signal() and recorded in signal->flags using SIGNAL_CLD_CONTINUED\|STOP flags. get_signal_deliver() should test these flags and then notify accoringly; however, it incorrectly tested SIGNAL_STOP_CONTINUED instead of SIGNAL_CLD_CONTINUED, thus incorrectly notifying CLD_CONTINUED if the signal is delivered before the task is wait(2)ed and CLD_STOPPED if the state was fetched already. Fix it by testing SIGNAL_CLD_CONTINUED. While at it, uncompress the ?: test into if/else clause for better readability. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com>	2011-03-23 10:36:59 +01:00
Ingo Molnar	e1eb029081	Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent	2011-03-23 10:35:17 +01:00
Mandeep Singh Baines	5af5bcb8d3	printk: allow setting DEFAULT_MESSAGE_LEVEL via Kconfig We've been burned by regressions/bugs which we later realized could have been triaged quicker if only we'd paid closer attention to dmesg. To make it easier to audit dmesg, we'd like to make DEFAULT_MESSAGE_LEVEL Kconfig-settable. That way we can set it to KERN_NOTICE and audit any messages <= KERN_WARNING. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Cc: Olof Johansson <olofj@chromium.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:13 -07:00
Kees Cook	9f36e2c448	printk: use %pK for /proc/kallsyms and /proc/modules In an effort to reduce kernel address leaks that might be used to help target kernel privilege escalation exploits, this patch uses %pK when displaying addresses in /proc/kallsyms, /proc/modules, and /sys/module//sections/. Note that this changes %x to %p, so some legitimately 0 values in /proc/kallsyms would have changed from 00000000 to "(null)". To avoid this, "(null)" is not used when using the "K" format. Anything that was already successfully parsing "(null)" in addition to full hex digits should have no problem with this change. (Thanks to Joe Perches for the suggestion.) Due to the %x to %p, "void *" casts are needed since these addresses are already "unsigned long" everywhere internally, due to their starting life as ELF section offsets. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: Eugene Teo <eugene@redhat.com> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:12 -07:00
Feng Tang	fe3d8ad31c	console: prevent registered consoles from dumping old kernel message over again For a platform with many consoles like: "console=tty1 console=ttyMFD2 console=ttyS0 earlyprintk=mrst" Each time when the non "selected_console" (tty1 and ttyMFD2 here) get registered, the existing kernel message will be printed out on registered consoles again, the "mrst" early console will get some same message for 3 times, and "tty1" will get some for twice. As suggested by Andrew Morton, every time a new console is registered, it will be set as the "exclusive" console which will dump the already existing kernel messages. Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:12 -07:00
Fabio M. Di Nitto	7bf693951a	console: allow to retain boot console via boot option keep_bootcon On some architectures, the boot process involves de-registering the boot console (early boot), initialize drivers and then re-register the console. This mechanism introduces a window in which no printk can happen on the console and messages are buffered and then printed once the new console is available. If a kernel crashes during this window, all it's left on the boot console is "console [foo] enabled, bootconsole disabled" making debug of the crash rather 'interesting'. By adding "keep_bootcon" option, do not unregister the boot console, that will allow to printk everything that is happening up to the crash. The option is clearly meant only for debugging purposes as it introduces lots of duplicated info printed on console, but will make bug report from users easier as it doesn't require a kernel build just to figure out where we crash. Signed-off-by: Fabio M. Di Nitto <fabbione@fabbione.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:12 -07:00
Don Zickus	f99a99330f	kernel/watchdog.c: always return NOTIFY_OK during cpu up/down events This patch addresses a couple of problems. One was the case when the hardlockup failed to start, it also failed to start the softlockup. There were valid cases when the hardlockup shouldn't start and that shouldn't block the softlockup (no lapic, bios controls perf counters). The second problem was when the hardlockup failed to start on boxes (from a no lapic or bios controlled perf counter case), it reported failure to the cpu notifier chain. This blocked the notifier from continuing to start other more critical pieces of cpu bring-up (in our case based on a 2.6.32 fork, it was the mce). As a result, during soft cpu online/offline testing, the system would panic when a cpu was offlined because the cpu notifier would succeed in processing a watchdog disable cpu event and would panic in the mce case as a result of un-initialized variables from a never executed cpu up event. I realized the hardlockup/softlockup cases are really just debugging aids and should never impede the progress of a cpu up/down event. Therefore I modified the code to always return NOTIFY_OK and instead rely on printks to inform the user of problems. Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:12 -07:00
Don Zickus	fef2c9bc1b	kernel/watchdog.c: allow hardlockup to panic by default When a cpu is considered stuck, instead of limping along and just printing a warning, it is sometimes preferred to just panic, let kdump capture the vmcore and reboot. This gets the machine back into a stable state quickly while saving the info that got it into a stuck state to begin with. Add a Kconfig option to allow users to set the hardlockup to panic by default. Also add in a 'nmi_watchdog=nopanic' to override this. [akpm@linux-foundation.org: fix strncmp length] Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:12 -07:00
Oleg Nesterov	9bfb23fc4a	sys_unshare: remove the dead CLONE_THREAD/SIGHAND/VM code Cleanup: kill the dead code which does nothing but complicates the code and confuses the reader. sys_unshare(CLONE_THREAD/SIGHAND/VM) is not really implemented, and I doubt very much it will ever work. At least, nobody even tried since the original `99d1419d96` ("unshare system call -v5: system call handler function") was applied more than 4 years ago. And the code is not consistent. unshare_thread() always fails unconditionally, while unshare_sighand() and unshare_vm() pretend to work if there is nothing to unshare. Remove unshare_thread(), unshare_sighand(), unshare_vm() helpers and related variables and add a simple CLONE_THREAD \| CLONE_SIGHAND\| CLONE_VM check into check_unshare_flags(). Also, move the "CLONE_NEWNS needs CLONE_FS" check from check_unshare_flags() to sys_unshare(). This looks more consistent and matches the similar do_sysvsem check in sys_unshare(). Note: with or without this patch "atomic_read(mm->mm_users) > 1" can give a false positive due to get_task_mm(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Janak Desai <janak@us.ibm.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:11 -07:00
Michael Rodriguez	4d51985e48	kernel/cpu.c: fix many errors related to style. Change the printk() calls to have the KERN_INFO/KERN_ERROR stuff, and fixes other coding style errors. Not _all_ of them are gone, though. [akpm@linux-foundation.org: revert the bits I disagree with] Signed-off-by: Michael Rodriguez <dkingston02@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:11 -07:00
Amerigo Wang	34db18a054	smp: move smp setup functions to kernel/smp.c Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter setup functions from init/main.c to kenrel/smp.c, saves some #ifdef CONFIG_SMP. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Rakib Mullick <rakib.mullick@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:11 -07:00
Olaf Hering	d404ab0a11	move x86 specific oops=panic to generic code The oops=panic cmdline option is not x86 specific, move it to generic code. Update documentation. Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:11 -07:00
Eric Dumazet	94dcf29a11	kthread: use kthread_create_on_node() ksoftirqd, kworker, migration, and pktgend kthreads can be created with kthread_create_on_node(), to get proper NUMA affinities for their stack and task_struct. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:01 -07:00
Eric Dumazet	207205a2ba	kthread: NUMA aware kthread_create_on_node() All kthreads being created from a single helper task, they all use memory from a single node for their kernel stack and task struct. This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter to parameters already used by kthread_create(). This parameter serves in allocating memory for the new kthread on its memory node if possible. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:01 -07:00
Eric Dumazet	b6a84016bd	mm: NUMA aware alloc_thread_info_node() Add a node parameter to alloc_thread_info(), and change its name to alloc_thread_info_node() This change is needed to allow NUMA aware kthread_create_on_cpu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:01 -07:00
Eric Dumazet	504f52b543	mm: NUMA aware alloc_task_struct_node() All kthreads being created from a single helper task, they all use memory from a single node for their kernel stack and task struct. This patch suite creates kthread_create_on_cpu(), adding a 'cpu' parameter to parameters already used by kthread_create(). This parameter serves in allocating memory for the new kthread on its memory node if available. Users of this new function are : ksoftirqd, kworker, migration, pktgend... This patch: Add a node parameter to alloc_task_struct(), and change its name to alloc_task_struct_node() This change is needed to allow NUMA aware kthread_create_on_cpu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:01 -07:00
Phil Carmody	8d2587970b	cgroups: if you list_empty() a head then don't list_del() it list_del() leaves poison in the prev and next pointers. The next list_empty() will compare those poisons, and say the list isn't empty. Any list operations that assume the node is on a list because of such a check will be fooled into dereferencing poison. One needs to INIT the node after the del, and fortunately there's already a wrapper for that - list_del_init(). Some of the dels are followed by deallocations, so can be ignored, and one can be merged with an add to make a move. Apart from that, I erred on the side of caution in making nodes list_empty()-queriable. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:43:58 -07:00
Jiri Olsa	1106b6997d	tracing: Fix set_ftrace_filter probe function display If one or more function probes (like traceon) are enabled, and there's no other function filter, the first probe func is skipped (which one depends on the position in the hash). $ echo sys_open:traceon sys_close:traceon > ./set_ftrace_filter $ cat set_ftrace_filter #### all functions enabled #### sys_close:traceon:unlimited $ The reason was, that in the case of no other function filter, the func_pos was not properly updated before calling t_hash_start. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1297874134-7008-1-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-03-22 12:52:03 -04:00
Julien Tinnes	da48524eb2	Prevent rt_sigqueueinfo and rt_tgsigqueueinfo from spoofing the signal code Userland should be able to trust the pid and uid of the sender of a signal if the si_code is SI_TKILL. Unfortunately, the kernel has historically allowed sigqueueinfo() to send any si_code at all (as long as it was negative - to distinguish it from kernel-generated signals like SIGILL etc), so it could spoof a SI_TKILL with incorrect siginfo values. Happily, it looks like glibc has always set si_code to the appropriate SI_QUEUE, so there are probably no actual user code that ever uses anything but the appropriate SI_QUEUE flag. So just tighten the check for si_code (we used to allow any negative value), and add a (one-time) warning in case there are binaries out there that might depend on using other si_code values. Signed-off-by: Julien Tinnes <jln@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-21 14:23:43 -07:00
Linus Torvalds	a44f99c7ef	Merge branch 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 * 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: (25 commits) video: change to new flag variable scsi: change to new flag variable rtc: change to new flag variable rapidio: change to new flag variable pps: change to new flag variable net: change to new flag variable misc: change to new flag variable message: change to new flag variable memstick: change to new flag variable isdn: change to new flag variable ieee802154: change to new flag variable ide: change to new flag variable hwmon: change to new flag variable dma: change to new flag variable char: change to new flag variable fs: change to new flag variable xtensa: change to new flag variable um: change to new flag variables s390: change to new flag variable mips: change to new flag variable ... Fix up trivial conflict in drivers/hwmon/Makefile	2011-03-20 18:14:55 -07:00
Randy Dunlap	16addf954d	sched: Fix yield_to kernel-doc Add missing function parameters for yield_to(): Warning(kernel/sched.c:5470): No description found for parameter 'p' Warning(kernel/sched.c:5470): No description found for parameter 'preempt' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110318093453.8f7489a4.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-19 19:19:36 +01:00
Linus Torvalds	508996b6a0	Merge branches 'irq-fixes-for-linus' and 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix incorrect unlock in __setup_irq() cris: Use generic show_interrupts() genirq: show_interrupts: Check desc->name before printing it blindly cris: Use accessor functions to set IRQ_PER_CPU flag cris: Fix irq conversion fallout * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, kernel-doc: Fix runqueue_is_locked() description	2011-03-18 10:44:05 -07:00
Linus Torvalds	619297855a	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) trace, filters: Initialize the match variable in process_ops() properly trace, documentation: Fix branch profiling location in debugfs oprofile, s390: Cleanups oprofile, s390: Remove hwsampler_files.c and merge it into init.c perf: Fix tear-down of inherited group events perf: Reorder & optimize perf_event_context to remove alignment padding on 64 bit builds perf: Handle stopped state with tracepoints perf: Fix the software events state check perf, powerpc: Handle events that raise an exception without overflowing perf, x86: Use INTEL_*_CONSTRAINT() for all PEBS event constraints perf, x86: Clean up SandyBridge PEBS events perf lock: Fix sorting by wait_min perf tools: Version incorrect with some versions of grep perf evlist: New command to list the names of events present in a perf.data file perf script: Add support for H/W and S/W events perf script: Add support for dumping symbols perf script: Support custom field selection for output perf script: Move printing of 'common' data from print_event and rename perf tracing: Remove print_graph_cpu and print_graph_proc from trace-event-parse perf script: Change process_event prototype ...	2011-03-18 10:38:34 -07:00
Linus Torvalds	e16b396ce3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits) doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore Update cpuset info & webiste for cgroups dcdbas: force SMI to happen when expected arch/arm/Kconfig: remove one to many l's in the word. asm-generic/user.h: Fix spelling in comment drm: fix printk typo 'sracth' Remove one to many n's in a word Documentation/filesystems/romfs.txt: fixing link to genromfs drivers:scsi Change printk typo initate -> initiate serial, pch uart: Remove duplicate inclusion of linux/pci.h header fs/eventpoll.c: fix spelling mm: Fix out-of-date comments which refers non-existent functions drm: Fix printk typo 'failled' coh901318.c: Change initate to initiate. mbox-db5500.c Change initate to initiate. edac: correct i82975x error-info reported edac: correct i82975x mci initialisation edac: correct commented info fs: update comments to point correct document target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c ... Trivial conflict in fs/eventpoll.c (spelling vs addition)	2011-03-18 10:37:40 -07:00
David Howells	d57f078b19	KGDB: Notify GDB of machine halt, reboot or power off Notify GDB of the machine halting, rebooting or powering off by sending it an exited command (remote protocol command 'W'). This is done by calling: void gdbstub_exit(int status) from the arch's machine_{halt,restart,power_off}() functions with an appropriate exit status to be reported to GDB. Signed-off-by: David Howells <dhowells@redhat.com>	2011-03-18 16:54:31 +00:00
Ingo Molnar	1ef1d1c235	trace, filters: Initialize the match variable in process_ops() properly Make sure the 'match' variable always has a value. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-18 14:41:27 +01:00
Linus Torvalds	ec0afc9311	Merge branch 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (55 commits) KVM: unbreak userspace that does not sets tss address KVM: MMU: cleanup pte write path KVM: MMU: introduce a common function to get no-dirty-logged slot KVM: fix rcu usage in init_rmode_* functions KVM: fix kvmclock regression due to missing clock update KVM: emulator: Fix permission checking in io permission bitmap KVM: emulator: Fix io permission checking for 64bit guest KVM: SVM: Load %gs earlier if CONFIG_X86_32_LAZY_GS=n KVM: x86: Remove useless regs_page pointer from kvm_lapic KVM: improve comment on rcu use in irqfd_deassign KVM: MMU: remove unused macros KVM: MMU: cleanup page alloc and free KVM: MMU: do not record gfn in kvm_mmu_pte_write KVM: MMU: move mmu pages calculated out of mmu lock KVM: MMU: set spte accessed bit properly KVM: MMU: fix kvm_mmu_slot_remove_write_access dropping intermediate W bits KVM: Start lock documentation KVM: better readability of efer_reserved_bits KVM: Clear async page fault hash after switching to real mode KVM: VMX: Initialize vm86 TSS only once. ...	2011-03-17 18:40:35 -07:00
Milton Miller	c8def554d0	smp_call_function_interrupt: use typedef and %pf Use the newly added smp_call_func_t in smp_call_function_interrupt for the func variable, and make the comment above the WARN more assertive and explicit. Also, func is a function pointer and does not need an offset, so use %pf not %pS. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-17 16:58:11 -07:00
Milton Miller	723aae25d5	smp_call_function_many: handle concurrent clearing of mask Mike Galbraith reported finding a lockup ("perma-spin bug") where the cpumask passed to smp_call_function_many was cleared by other cpu(s) while a cpu was preparing its call_data block, resulting in no cpu to clear the last ref and unlock the block. Having cpus clear their bit asynchronously could be useful on a mask of cpus that might have a translation context, or cpus that need a push to complete an rcu window. Instead of adding a BUG_ON and requiring yet another cpumask copy, just detect the race and handle it. Note: arch_send_call_function_ipi_mask must still handle an empty cpumask because the data block is globally visible before the that arch callback is made. And (obviously) there are no guarantees to which cpus are notified if the mask is changed during the call; only cpus that were online and had their mask bit set during the whole call are guaranteed to be called. Reported-by: Mike Galbraith <efault@gmx.de> Reported-by: Jan Beulich <JBeulich@novell.com> Acked-by: Jan Beulich <jbeulich@novell.com> Cc: stable@kernel.org Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-17 16:58:10 -07:00
Milton Miller	45a5791920	call_function_many: add missing ordering Paul McKenney's review pointed out two problems with the barriers in the 2.6.38 update to the smp call function many code. First, a barrier that would force the func and info members of data to be visible before their consumption in the interrupt handler was missing. This can be solved by adding a smp_wmb between setting the func and info members and setting setting the cpumask; this will pair with the existing and required smp_rmb ordering the cpumask read before the read of refs. This placement avoids the need a second smp_rmb in the interrupt handler which would be executed on each of the N cpus executing the call request. (I was thinking this barrier was present but was not). Second, the previous write to refs (establishing the zero that we the interrupt handler was testing from all cpus) was performed by a third party cpu. This would invoke transitivity which, as a recient or concurrent addition to memory-barriers.txt now explicitly states, would require a full smp_mb(). However, we know the cpumask will only be set by one cpu (the data owner) and any preivous iteration of the mask would have cleared by the reading cpu. By redundantly writing refs to 0 on the owning cpu before the smp_wmb, the write to refs will follow the same path as the writes that set the cpumask, which in turn allows us to keep the barrier in the interrupt handler a smp_rmb instead of promoting it to a smp_mb (which will be be executed by N cpus for each of the possible M elements on the list). I moved and expanded the comment about our (ab)use of the rcu list primitives for the concurrent walk earlier into this function. I considered moving the first two paragraphs to the queue list head and lock, but felt it would have been too disconected from the code. Cc: Paul McKinney <paulmck@linux.vnet.ibm.com> Cc: stable@kernel.org (2.6.32 and later) Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-17 16:58:10 -07:00
Milton Miller	e6cd1e07a1	call_function_many: fix list delete vs add race Peter pointed out there was nothing preventing the list_del_rcu in smp_call_function_interrupt from running before the list_add_rcu in smp_call_function_many. Fix this by not setting refs until we have gotten the lock for the list. Take advantage of the wmb in list_add_rcu to save an explicit additional one. I tried to force this race with a udelay before the lock & list_add and by mixing all 64 online cpus with just 3 random cpus in the mask, but was unsuccessful. Still, inspection shows a valid race, and the fix is a extension of the existing protection window in the current code. Cc: stable@kernel.org (v2.6.32 and later) Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-17 16:58:10 -07:00
Rik van Riel	77c100c83e	export pid symbols needed for kvm_vcpu_on_spin Export the symbols required for a race-free kvm_vcpu_on_spin. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-03-17 13:08:28 -03:00
Dan Carpenter	1c389795c1	genirq: Fix incorrect unlock in __setup_irq() goto out_thread is called before we take the lock. It causes a gcc warning: "kernel/irq/manage.c:858: warning: ‘flags’ may be used uninitialized in this function" [ tglx: Moved unlock before free_cpumask_var() ] Signed-off-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <20110317114307.GJ2008@bicker> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-17 15:52:30 +01:00
Thomas Gleixner	ee0401ec11	genirq: show_interrupts: Check desc->name before printing it blindly desc->name is not required and not used by all architectures. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-17 15:52:19 +01:00
matt mooney	ed3cd4a865	kernel: change to new flag variable Replace EXTRA_CFLAGS with ccflags-y. Signed-off-by: matt mooney <mfm@muteddisk.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>	2011-03-17 14:02:54 +01:00
David Rientjes	13e5befadd	trace, documentation: Fix branch profiling location in debugfs The debugfs interface for branch profiling is through /sys/kernel/debug/tracing/trace_stat/branch_annotated /sys/kernel/debug/tracing/trace_stat/branch_all so update the Kconfig accordingly. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <alpine.DEB.2.00.1103161716320.11407@chino.kir.corp.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-17 09:10:45 +01:00
Paul Mundt	1d2a1959fe	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into sh-latest	2011-03-17 16:44:08 +09:00
Linus Torvalds	f74b944419	Merge branch 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl * 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: BKL: That's all, folks fs/locks.c: Remove stale FIXME left over from BKL conversion ipx: remove the BKL appletalk: remove the BKL x25: remove the BKL ufs: remove the BKL hpfs: remove the BKL drivers: remove extraneous includes of smp_lock.h tracing: don't trace the BKL adfs: remove the big kernel lock	2011-03-16 17:21:00 -07:00
Linus Torvalds	7a6362800c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits) bonding: enable netpoll without checking link status xfrm: Refcount destination entry on xfrm_lookup net: introduce rx_handler results and logic around that bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag bonding: wrap slave state work net: get rid of multiple bond-related netdevice->priv_flags bonding: register slave pointer for rx_handler be2net: Bump up the version number be2net: Copyright notice change. Update to Emulex instead of ServerEngines e1000e: fix kconfig for crc32 dependency netfilter ebtables: fix xt_AUDIT to work with ebtables xen network backend driver bonding: Improve syslog message at device creation time bonding: Call netif_carrier_off after register_netdevice bonding: Incorrect TX queue offset net_sched: fix ip_tos2prio xfrm: fix __xfrm_route_forward() be2net: Fix UDP packet detected status in RX compl Phonet: fix aligned-mode pipe socket buffer header reserve netxen: support for GbE port settings ... Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c with the staging updates.	2011-03-16 16:29:25 -07:00
Linus Torvalds	a5e6b135bd	Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (50 commits) printk: do not mangle valid userspace syslog prefixes efivars: Add Documentation efivars: Expose efivars functionality to external drivers. efivars: Parameterize operations. efivars: Split out variable registration efivars: parameterize efivars efivars: Make efivars bin_attributes dynamic efivars: move efivars globals into struct efivars drivers:misc: ti-st: fix debugging code kref: Fix typo in kref documentation UIO: add PRUSS UIO driver support Fix spelling mistakes in Documentation/zh_CN/SubmittingPatches firmware: Fix unaligned memory accesses in dmi-sysfs firmware: Add documentation for /sys/firmware/dmi firmware: Expose DMI type 15 System Event Log firmware: Break out system_event_log in dmi-sysfs firmware: Basic dmi-sysfs support firmware: Add DMI entry types to the headers Driver core: convert platform_{get,set}_drvdata to static inline functions Translate linux-2.6/Documentation/magic-number.txt into Chinese ...	2011-03-16 15:05:40 -07:00
Randy Dunlap	1fd06bb157	sched.c: fix kernel-doc for runqueue_is_locked() Fix kernel-doc warning for runqueue_is_locked(): Warning(kernel/sched.c:664): missing initial short description on line: Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-16 10:47:04 -07:00
Linus Torvalds	fc82e1d59a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: (21 commits) PM / Hibernate: Reduce autotuned default image size PM / Core: Introduce struct syscore_ops for core subsystems PM PM QoS: Make pm_qos settings readable PM / OPP: opp_find_freq_exact() documentation fix PM: Documentation/power/states.txt: fix repetition PM: Make system-wide PM and runtime PM treat subsystems consistently PM: Simplify kernel/power/Kconfig PM: Add support for device power domains PM: Drop pm_flags that is not necessary PM: Allow pm_runtime_suspend() to succeed during system suspend PM: Clean up PM_TRACE dependencies and drop unnecessary Kconfig option PM: Remove CONFIG_PM_OPS PM: Reorder power management Kconfig options PM: Make CONFIG_PM depend on (CONFIG_PM_SLEEP \|\| CONFIG_PM_RUNTIME) PM / ACPI: Remove references to pm_flags from bus.c PM: Do not create wakeup sysfs files for devices that cannot wake up USB / Hub: Do not call device_set_wakeup_capable() under spinlock PM: Use appropriate printk() priority level in trace.c PM / Wakeup: Don't update events_check_enabled in pm_get_wakeup_count() PM / Wakeup: Make pm_save_wakeup_count() work as documented ...	2011-03-16 09:24:44 -07:00
Linus Torvalds	0f6e0e8448	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits) AppArmor: kill unused macros in lsm.c AppArmor: cleanup generated files correctly KEYS: Add an iovec version of KEYCTL_INSTANTIATE KEYS: Add a new keyctl op to reject a key with a specified error code KEYS: Add a key type op to permit the key description to be vetted KEYS: Add an RCU payload dereference macro AppArmor: Cleanup make file to remove cruft and make it easier to read SELinux: implement the new sb_remount LSM hook LSM: Pass -o remount options to the LSM SELinux: Compute SID for the newly created socket SELinux: Socket retains creator role and MLS attribute SELinux: Auto-generate security_is_socket_class TOMOYO: Fix memory leak upon file open. Revert "selinux: simplify ioctl checking" selinux: drop unused packet flow permissions selinux: Fix packet forwarding checks on postrouting selinux: Fix wrong checks for selinux_policycap_netpeer selinux: Fix check for xfrm selinux context algorithm ima: remove unnecessary call to ima_must_measure IMA: remove IMA imbalance checking ...	2011-03-16 09:15:43 -07:00
Linus Torvalds	bd2895eead	Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER	2011-03-16 08:20:19 -07:00
Linus Torvalds	016aa2ed1c	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: smp: Document transitivity for memory barriers. rcu: add comment saying why DEBUG_OBJECTS_RCU_HEAD depends on PREEMPT. rcupdate: remove dead code rcu: add documentation saying which RCU flavor to choose rcutorture: Get rid of duplicate sched.h include rcu: call __rcu_read_unlock() in exit_rcu for tiny RCU	2011-03-16 08:10:07 -07:00
Peter Zijlstra	38b435b16c	perf: Fix tear-down of inherited group events When destroying inherited events, we need to destroy groups too, otherwise the event iteration in perf_event_exit_task_context() will miss group siblings and we leak events with all the consequences. Reported-and-tested-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> # .35+ LKML-Reference: <1300196470.2203.61.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-16 14:04:14 +01:00
Frederic Weisbecker	a0f7d0f7fc	perf: Handle stopped state with tracepoints We toggle the state from start and stop callbacks but actually don't check it when the event triggers. Do it so that these callbacks actually work. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1299529629-18280-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-16 14:04:14 +01:00
Frederic Weisbecker	91b2f482e6	perf: Fix the software events state check Fix the mistakenly inverted check of events state. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: <stable@kernel.org> LKML-Reference: <1299529629-18280-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-16 14:04:13 +01:00
Randy Dunlap	58cbe2476a	sched, kernel-doc: Fix runqueue_is_locked() description Fix kernel-doc warning for runqueue_is_locked(): Warning(kernel/sched.c:664): missing initial short description Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110315161230.c4e1e8e3.rdunlap@xenotime.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-16 14:00:23 +01:00
Linus Torvalds	5f6fb45466	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (116 commits) x86: Enable forced interrupt threading support x86: Mark low level interrupts IRQF_NO_THREAD x86: Use generic show_interrupts x86: ioapic: Avoid redundant lookup of irq_cfg x86: ioapic: Use new move_irq functions x86: Use the proper accessors in fixup_irqs() x86: ioapic: Use irq_data->state x86: ioapic: Simplify irq chip and handler setup x86: Cleanup the genirq name space genirq: Add chip flag to force mask on suspend genirq: Add desc->irq_data accessor genirq: Add comments to Kconfig switches genirq: Fixup fasteoi handler for oneshot mode genirq: Provide forced interrupt threading sched: Switch wait_task_inactive to schedule_hrtimeout() genirq: Add IRQF_NO_THREAD genirq: Allow shared oneshot interrupts genirq: Prepare the handling of shared oneshot interrupts genirq: Make warning in handle_percpu_event useful x86: ioapic: Move trigger defines to io_apic.h ... Fix up trivial(?) conflicts in arch/x86/pci/xen.c due to genirq name space changes clashing with the Xen cleanups. The set_irq_msi() had moved to xen_bind_pirq_msi_to_irq().	2011-03-15 19:23:40 -07:00
Linus Torvalds	420c1c572d	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (62 commits) posix-clocks: Check write permissions in posix syscalls hrtimer: Remove empty hrtimer_init_hres_timer() hrtimer: Update hrtimer->state documentation hrtimer: Update base[CLOCK_BOOTTIME].offset correctly timers: Export CLOCK_BOOTTIME via the posix timers interface timers: Add CLOCK_BOOTTIME hrtimer base time: Extend get_xtime_and_monotonic_offset() to also return sleep time: Introduce get_monotonic_boottime and ktime_get_boottime hrtimers: extend hrtimer base code to handle more then 2 clockids ntp: Remove redundant and incorrect parameter check mn10300: Switch do_timer() to xtimer_update() posix clocks: Introduce dynamic clocks posix-timers: Cleanup namespace posix-timers: Add support for fd based clocks x86: Add clock_adjtime for x86 posix-timers: Introduce a syscall for clock tuning. time: Splitout compat timex accessors ntp: Add ADJ_SETOFFSET mode bit time: Introduce timekeeping_inject_offset posix-timer: Update comment ... Fix up new system-call-related conflicts in arch/x86/ia32/ia32entry.S arch/x86/include/asm/unistd_32.h arch/x86/include/asm/unistd_64.h arch/x86/kernel/syscall_table_32.S (name_to_handle_at()/open_by_handle_at() vs clock_adjtime()), and some due to movement of get_jiffies_64() in: kernel/time.c	2011-03-15 18:53:35 -07:00
Linus Torvalds	9620639b7e	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits) sched: Resched proper CPU on yield_to() sched: Allow users with sufficient RLIMIT_NICE to change from SCHED_IDLE policy sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks sched: Clean up the IRQ_TIME_ACCOUNTING code sched: Add #ifdef around irq time accounting functions sched, autogroup: Stop claiming ownership of the root task group sched, autogroup: Stop going ahead if autogroup is disabled sched, autogroup, sysctl: Use proc_dointvec_minmax() instead sched: Fix the group_imb logic sched: Clean up some f_b_g() comments sched: Clean up remnants of sd_idle sched: Wholesale removal of sd_idle logic sched: Add yield_to(task, preempt) functionality sched: Use a buddy to implement yield_task_fair() sched: Limit the scope of clear_buddies sched: Check the right ->nr_running in yield_task_fair() sched: Avoid expensive initial update_cfs_load(), on UP too sched: Fix switch_from_fair() sched: Simplify the idle scheduling class softirqs: Account ksoftirqd time as cpustat softirq ...	2011-03-15 18:37:30 -07:00
Linus Torvalds	a926021cb1	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (184 commits) perf probe: Clean up probe_point_lazy_walker() return value tracing: Fix irqoff selftest expanding max buffer tracing: Align 4 byte ints together in struct tracer tracing: Export trace_set_clr_event() tracing: Explain about unstable clock on resume with ring buffer warning ftrace/graph: Trace function entry before updating index ftrace: Add .ref.text as one of the safe areas to trace tracing: Adjust conditional expression latency formatting. tracing: Fix event alignment: skb:kfree_skb tracing: Fix event alignment: mce:mce_record tracing: Fix event alignment: kvm:kvm_hv_hypercall tracing: Fix event alignment: module:module_request tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup tracing: Remove lock_depth from event entry perf header: Stop using 'self' perf session: Use evlist/evsel for managing perf.data attributes perf top: Don't let events to eat up whole header line perf top: Fix events overflow in top command ring-buffer: Remove unused #include <linux/trace_irq.h> tracing: Add an 'overwrite' trace_option. ...	2011-03-15 18:31:30 -07:00
Linus Torvalds	0586bed3e8	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rtmutex: tester: Remove the remaining BKL leftovers lockdep/timers: Explain in detail the locking problems del_timer_sync() may cause rtmutex: Simplify PI algorithm and make highest prio task get lock rwsem: Remove redundant asmregparm annotation rwsem: Move duplicate function prototypes to linux/rwsem.h rwsem: Unify the duplicate rwsem_is_locked() inlines rwsem: Move duplicate init macros and functions to linux/rwsem.h rwsem: Move duplicate struct rwsem declaration to linux/rwsem.h x86: Cleanup rwsem_count_t typedef rwsem: Cleanup includes locking: Remove deprecated lock initializers cred: Replace deprecated spinlock initialization kthread: Replace deprecated spinlock initialization xtensa: Replace deprecated spinlock initialization um: Replace deprecated spinlock initialization sparc: Replace deprecated spinlock initialization mips: Replace deprecated spinlock initialization cris: Replace deprecated spinlock initialization alpha: Replace deprecated spinlock initialization rtmutex-tester: Remove BKL tests	2011-03-15 18:28:30 -07:00
Linus Torvalds	b80cd62b7d	Merge branch 'core-futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: arm: Remove bogus comment in futex_atomic_cmpxchg_inatomic() futex: Deobfuscate handle_futex_death() plist: Add priority list test plist: Shrink struct plist_head futex,plist: Remove debug lock assignment from plist_node futex,plist: Pass the real head of the priority list to plist_del() futex: Sanitize futex ops argument types futex: Sanitize cmpxchg_futex_value_locked API futex: Remove redundant pagefault_disable in futex_atomic_cmpxchg_inatomic() futex: Avoid redudant evaluation of task_pid_vnr() futex: Update futex_wait_setup comments about locking	2011-03-15 18:23:52 -07:00
Linus Torvalds	c345f60a5f	Merge branch 'core-debugobjects-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-debugobjects-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: debugobjects: Add hint for better object identification	2011-03-15 18:23:25 -07:00
Linus Torvalds	422e6c4bc4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits) tidy the trailing symlinks traversal up Turn resolution of trailing symlinks iterative everywhere simplify link_path_walk() tail Make trailing symlink resolution in path_lookupat() iterative update nd->inode in __do_follow_link() instead of after do_follow_link() pull handling of one pathname component into a helper fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH Allow passing O_PATH descriptors via SCM_RIGHTS datagrams readlinkat(), fchownat() and fstatat() with empty relative pathnames Allow O_PATH for symlinks New kind of open files - "location only". ext4: Copy fs UUID to superblock ext3: Copy fs UUID to superblock. vfs: Export file system uuid via /proc/<pid>/mountinfo unistd.h: Add new syscalls numbers to asm-generic x86: Add new syscalls for x86_64 x86: Add new syscalls for x86_32 fs: Remove i_nlink check from file system link callback fs: Don't allow to create hardlink for deleted file vfs: Add open by file handle support ...	2011-03-15 15:48:13 -07:00
James Morris	a002951c97	Merge branch 'next' into for-linus	2011-03-16 09:41:17 +11:00
David S. Miller	30df754ded	Merge branch 'irq/numa' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2011-03-15 15:06:35 -07:00
Linus Torvalds	397fae0818	Merge branches 'stable/irq.rework' and 'stable/pcifront-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/irq.rework' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/irq: Cleanup up the pirq_to_irq for DomU PV PCI passthrough guests as well. xen: Use IRQF_FORCE_RESUME xen/timer: Missing IRQF_NO_SUSPEND in timer code broke suspend. xen: Fix compile error introduced by "switch to new irq_chip functions" xen: Switch to new irq_chip functions xen: Remove stale irq_chip.end xen: events: do not free legacy IRQs xen: events: allocate GSIs and dynamic IRQs from separate IRQ ranges. xen: events: add xen_allocate_irq_{dynamic, gsi} and xen_free_irq xen:events: move find_unbound_irq inside CONFIG_PCI_MSI xen: handled remapped IRQs when enabling a pcifront PCI device. genirq: Add IRQF_FORCE_RESUME * 'stable/pcifront-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: pci/xen: When free-ing MSI-X/MSI irq->desc also use generic code. pci/xen: Cleanup: convert int** to int[] pci/xen: Use xen_allocate_pirq_msi instead of xen_allocate_pirq xen-pcifront: Sanity check the MSI/MSI-X values xen-pcifront: don't use flush_scheduled_work()	2011-03-15 10:47:16 -07:00
Aneesh Kumar K.V	becfd1f375	vfs: Add open by file handle support [AV: duplicate of open() guts removed; file_open_root() used instead] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Aneesh Kumar K.V	990d6c2d7a	vfs: Add name to file handle conversion support The syscall also return mount id which can be used to lookup file system specific information such as uuid in /proc/<pid>/mountinfo Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:37 -04:00

... 14 15 16 17 18 ...

12837 Commits