linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-30 16:11:38 +00:00

Author	SHA1	Message	Date
Frederic Weisbecker	00f57f545a	tracing/function-graph-tracer: fix a regression while suspend to disk Impact: fix a crash while kernel image restore When the function graph tracer is running and while suspend to disk, some racy and dangerous things happen against this tracer. The current task will save its registers including the stack pointer which contains the return address hooked by the tracer. But the current task will continue to enter other functions after that to save the memory, and then it will store other return addresses, and finally loose the old depth which matches the return address saved in the old stack (during the registers saving). So on image restore, the code will return to wrong addresses. And there are other things: on restore, the task will have it's "current" pointer overwritten during registers restoring....switching from one task to another... That would be insane to try to trace function graphs at these stages. This patch makes the function graph tracer listening on power events, making it's tracing disabled for the current task (the one that performs the hibernation work) while suspend/resume to disk, making the tracing safe during hibernation. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-21 15:21:30 +01:00
Paul Mundt	0609697eab	dma-coherent: Restore dma_alloc_from_coherent() large alloc fall back policy. When doing large allocations (larger than the per-device coherent area) the generic memory allocators are silently fallen back on regardless of consideration for the per-device constraints. In the DMA_MEMORY_EXCLUSIVE case falling back on generic memory is not an option, as it tends not to be addressable by the DMA hardware in question. This issue showed up with the 8139too breakage on the Dreamcast, where non-addressable buffers were silently allocated due to the size mismatch calculation -- while it should have simply errored out upon being unable to satisfy the allocation with the given device constraints. This restores fall back behaviour to what it was before the oversized request change caused multiple regressions. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2009-01-21 18:51:53 +09:00
Adrian McMenamin	cdf57cab27	dma-coherent: per-device coherent area is in pages, not bytes. Commit `58c6d3dfe4` ("dma-coherent: catch oversized requests to dma_alloc_from_coherent()") attempted to add a sanity check to bail out on allocations larger than the coherent area. Unfortunately when this was implemented, the fact the coherent area is tracked in pages rather than bytes was overlooked, which subsequently broke every single dma_alloc_from_coherent() user, forcing the allocation silently through generic memory instead. Signed-off-by: Adrian McMenamin <adrian@mcmen.demon.co.uk > Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2009-01-21 18:47:38 +09:00
Steven Rostedt	082605de5f	ring-buffer: fix alignment problem Impact: fix to allow some archs to use the ring buffer Commits in the ring buffer are checked by pointer arithmetic. If the calculation is incorrect, then the commits will never take place and the buffer will simply fill up and report an error. Each page in the ring buffer has a small header: struct buffer_data_page { u64 time_stamp; local_t commit; unsigned char data[]; }; Unfortuntely, some of the calculations used sizeof(struct buffer_data_page) to know the size of the header. But this is incorrect on some archs, where sizeof(struct buffer_data_page) does not equal offsetof(struct buffer_data_page, data), and on those archs, the commits are never processed. This patch replaces the sizeof with offsetof. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-20 13:09:06 +01:00
Rusty Russell	8ccad40df8	work_on_cpu: Use our own workqueue. Impact: remove potential clashes with generic kevent workqueue Annoyingly, some places we want to use work_on_cpu are already in workqueues. As per Ingo's suggestion, we create a different workqueue for work_on_cpu. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-19 22:36:07 +01:00
Rusty Russell	31ad908120	work_on_cpu: don't try to get_online_cpus() in work_on_cpu. Impact: remove potential circular lock dependency with cpu hotplug lock This has caused more problems than it solved, with a pile of cpu hotplug locking issues. Followup patches will get_online_cpus() in callers that need it, but if they don't do it they're no worse than before when they were using set_cpus_allowed without locking. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-19 22:36:02 +01:00
Miao Xie	f90d4118ba	cpuset: fix possible deadlock in async_rebuild_sched_domains Lockdep reported some possible circular locking info when we tested cpuset on NUMA/fake NUMA box. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.29-rc1-00224-ga652504 #111 ------------------------------------------------------- bash/2968 is trying to acquire lock: (events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8 but task is already holding lock: (cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29 which lock already depends on the new lock. ...... ------------------------------------------------------- Steps to reproduce: # mkdir /dev/cpuset # mount -t cpuset xxx /dev/cpuset # mkdir /dev/cpuset/0 # echo 0 > /dev/cpuset/0/cpus # echo 0 > /dev/cpuset/0/mems # echo 1 > /dev/cpuset/0/memory_migrate # cat /dev/zero > /dev/null & # echo $! > /dev/cpuset/0/tasks This is because async_rebuild_sched_domains has the following lock sequence: run_workqueue(async_rebuild_sched_domains) -> do_rebuild_sched_domains -> cgroup_lock But, attaching tasks when memory_migrate is set has following: cgroup_lock_live_group(cgroup_tasks_write) -> do_migrate_pages -> flush_work This patch fixes it by using a separate workqueue thread. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-19 02:44:00 +01:00
Peter Zijlstra	1d4a7f1c4f	hrtimers: fix inconsistent lock state on resume in hres_timers_resume Andrey Borzenkov reported this lockdep assert: > [17854.688347] ================================= > [17854.688347] [ INFO: inconsistent lock state ] > [17854.688347] 2.6.29-rc2-1avb #1 > [17854.688347] --------------------------------- > [17854.688347] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage. > [17854.688347] pm-suspend/18240 [HC0[0]:SC0[0]:HE1:SE1] takes: > [17854.688347] (&cpu_base->lock){++..}, at: [<c0136fcc>] retrigger_next_event+0x5c/0xa0 > [17854.688347] {in-hardirq-W} state was registered at: > [17854.688347] [<c01443cd>] __lock_acquire+0x79d/0x1930 > [17854.688347] [<c01455bc>] lock_acquire+0x5c/0x80 > [17854.688347] [<c03092e5>] _spin_lock+0x35/0x70 > [17854.688347] [<c0136e61>] hrtimer_run_queues+0x31/0x140 > [17854.688347] [<c0128d98>] run_local_timers+0x8/0x20 > [17854.688347] [<c0128dd3>] update_process_times+0x23/0x60 > [17854.688347] [<c013e274>] tick_periodic+0x24/0x80 > [17854.688347] [<c013e2e2>] tick_handle_periodic+0x12/0x70 > [17854.688347] [<c0104e24>] timer_interrupt+0x14/0x20 > [17854.688347] [<c01607b9>] handle_IRQ_event+0x29/0x60 > [17854.688347] [<c0161c59>] handle_level_irq+0x69/0xe0 > [17854.688347] [<ffffffff>] 0xffffffff > [17854.688347] irq event stamp: 55771 > [17854.688347] hardirqs last enabled at (55771): [<c0309125>] _spin_unlock_irqrestore+0x35/0x60 > [17854.688347] hardirqs last disabled at (55770): [<c0309419>] _spin_lock_irqsave+0x19/0x80 > [17854.688347] softirqs last enabled at (54836): [<c0124f54>] __do_softirq+0xc4/0x110 > [17854.688347] softirqs last disabled at (54831): [<c01049ae>] do_softirq+0x8e/0xe0 > [17854.688347] > [17854.688347] other info that might help us debug this: > [17854.688347] 3 locks held by pm-suspend/18240: > [17854.688347] #0: (&buffer->mutex){--..}, at: [<c01dd4c5>] sysfs_write_file+0x25/0x100 > [17854.688347] #1: (pm_mutex){--..}, at: [<c015056f>] enter_state+0x4f/0x140 > [17854.688347] #2: (dpm_list_mtx){--..}, at: [<c027880f>] device_pm_lock+0xf/0x20 > [17854.688347] > [17854.688347] stack backtrace: > [17854.688347] Pid: 18240, comm: pm-suspend Not tainted 2.6.29-rc2-1avb #1 > [17854.688347] Call Trace: > [17854.688347] [<c0306248>] ? printk+0x18/0x20 > [17854.688347] [<c0141fac>] print_usage_bug+0x16c/0x1d0 > [17854.688347] [<c0142bcf>] mark_lock+0x8bf/0xc90 > [17854.688347] [<c0106b8f>] ? pit_next_event+0x2f/0x40 > [17854.688347] [<c01441b0>] __lock_acquire+0x580/0x1930 > [17854.688347] [<c030916d>] ? _spin_unlock+0x1d/0x20 > [17854.688347] [<c0106b8f>] ? pit_next_event+0x2f/0x40 > [17854.688347] [<c013dd38>] ? clockevents_program_event+0x98/0x160 > [17854.688347] [<c0142fe8>] ? mark_held_locks+0x48/0x90 > [17854.688347] [<c0309125>] ? _spin_unlock_irqrestore+0x35/0x60 > [17854.688347] [<c0143229>] ? trace_hardirqs_on_caller+0x139/0x190 > [17854.688347] [<c014328b>] ? trace_hardirqs_on+0xb/0x10 > [17854.688347] [<c01455bc>] lock_acquire+0x5c/0x80 > [17854.688347] [<c0136fcc>] ? retrigger_next_event+0x5c/0xa0 > [17854.688347] [<c03092e5>] _spin_lock+0x35/0x70 > [17854.688347] [<c0136fcc>] ? retrigger_next_event+0x5c/0xa0 > [17854.688347] [<c0136fcc>] retrigger_next_event+0x5c/0xa0 > [17854.688347] [<c013711a>] hres_timers_resume+0xa/0x10 > [17854.688347] [<c013aa8e>] timekeeping_resume+0xee/0x150 > [17854.688347] [<c0273384>] __sysdev_resume+0x14/0x50 > [17854.688347] [<c0273407>] sysdev_resume+0x47/0x80 > [17854.688347] [<c02791ab>] device_power_up+0xb/0x20 > [17854.688347] [<c015043f>] suspend_devices_and_enter+0xcf/0x150 > [17854.688347] [<c0150c2f>] ? freeze_processes+0x3f/0x90 > [17854.688347] [<c0150614>] enter_state+0xf4/0x140 > [17854.688347] [<c01506dd>] state_store+0x7d/0xc0 > [17854.688347] [<c0150660>] ? state_store+0x0/0xc0 > [17854.688347] [<c0202da4>] kobj_attr_store+0x24/0x30 > [17854.688347] [<c01dd53c>] sysfs_write_file+0x9c/0x100 > [17854.688347] [<c019916c>] vfs_write+0x9c/0x160 > [17854.688347] [<c0103494>] ? restore_nocheck_notrace+0x0/0xe > [17854.688347] [<c01dd4a0>] ? sysfs_write_file+0x0/0x100 > [17854.688347] [<c01992ed>] sys_write+0x3d/0x70 > [17854.688347] [<c0103371>] sysenter_do_call+0x12/0x31 Andrey's analysis: > timekeeping_resume() is called via class ->resume > method; and according to comments in sysdev_resume() and > device_power_up(), they are called with interrupts disabled. > > Looking at suspend_enter, irqs are disabled at this point. > > So it actually looks like something (may be some driver) > unconditionally enabled irqs in resume path. Add a debug check to test this theory. If it triggers then it triggers because the resume code calls it with irqs enabled, which is a no-no not just for timekeeping_resume(), but also bad for a number of other resume handlers. Reported-by: Andrey Borzenkov <arvidjaar@mail.ru> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-18 21:31:37 +01:00
Jiri Slaby	b786c6a98e	relay: fix lock imbalance in relay_late_setup_files One fail path in relay_late_setup_files() omits mutex_unlock(&relay_channels_mutex); Add it. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-18 20:29:35 +01:00
Rafael J. Wysocki	091d71e023	PM: Fix compilation warning in kernel/power/main.c Reorder the code in kernel/power/main.c to fix compilation warning triggered by unsetting CONFIG_SUSPEND. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>	2009-01-16 18:13:41 -05:00
Len Brown	88d998c264	Merge branch 'misc' into release	2009-01-16 14:45:34 -05:00
Masami Hiramatsu	5a4ccaf37f	kprobes: check CONFIG_FREEZER instead of CONFIG_PM Check CONFIG_FREEZER instead of CONFIG_PM because kprobe booster depends on freeze_processes() and thaw_processes() when CONFIG_PREEMPT=y. This fixes a linkage error which occurs when CONFIG_PREEMPT=y, CONFIG_PM=y and CONFIG_FREEZER=n. Reported-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Len Brown <len.brown@intel.com>	2009-01-16 14:32:17 -05:00
Rafael J. Wysocki	33f1d7ecc6	PM: Fix freezer compilation if PM_SLEEP is unset Freezer fails to compile if with the following configuration settings: CONFIG_CGROUPS=y CONFIG_CGROUP_FREEZER=y CONFIG_MODULES=y CONFIG_FREEZER=y CONFIG_PM=y CONFIG_PM_SLEEP=n Fix this by making process.o compilation depend on CONFIG_FREEZER. Reported-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Len Brown <len.brown@intel.com>	2009-01-16 14:32:17 -05:00
Linus Torvalds	7cb36b6ccd	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: sched_slice() fixlet sched: fix update_min_vruntime sched: SCHED_OTHER vs SCHED_IDLE isolation sched: SCHED_IDLE weight change sched: fix bandwidth validation for UID grouping Revert "sched: improve preempt debugging"	2009-01-15 16:55:00 -08:00
Randy Dunlap	6ae301e85c	resources: fix parameter name and kernel-doc Fix __request_region() parameter kernel-doc notation and parameter name: Warning(linux-2.6.28-git10//kernel/resource.c:627): No description found for parameter 'flags' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-15 16:39:38 -08:00
Li Zefan	45ce80fb6b	cgroups: consolidate cgroup documents Move Documentation/cpusets.txt and Documentation/controllers/* to Documentation/cgroups/ Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-15 16:39:37 -08:00
Lin Ming	6272d68cc6	sched: sched_slice() fixlet Mike's change: `0a582440f` "sched: fix sched_slice())" broke group scheduling by forgetting to reload cfs_rq on each loop. This patch fixes aim7 regression and specjbb2005 regression becomes less than 1.5% on 8-core stokley. Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Jayson King <dev@jaysonking.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-15 21:07:57 +01:00
Doug Chapman	88fc241f54	[IA64] dump stack on kernel unaligned warnings Often the cause of kernel unaligned access warnings is not obvious from just the ip displayed in the warning. This adds the option via proc to dump the stack in addition to the warning. The default is off (just display the 1 line warning). To enable the stack to be shown: echo 1 > /proc/sys/kernel/unaligned-dump-stack Signed-off-by: Doug Chapman <doug.chapman@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2009-01-15 10:38:56 -08:00
Peter Zijlstra	e17036dac1	sched: fix update_min_vruntime Impact: fix SCHED_IDLE latency problems OK, so we have 1 running task A (which is obviously curr and the tree is equally obviously empty). 'A' nicely chugs along, doing its thing, carrying min_vruntime along as it goes. Then some whacko speed freak SCHED_IDLE task gets inserted due to SMP balancing, which is very likely far right, in that case update_curr update_min_vruntime cfs_rq->rb_leftmost := true (the crazy task sitting in a tree) vruntime = se->vruntime and voila, min_vruntime is waaay right of where it ought to be. OK, so why did I write it like that to begin with... Aah, yes. Say we've just dequeued current schedule deactivate_task(prev) dequeue_entity update_min_vruntime Then we'll set vruntime = cfs_rq->min_vruntime; we find !cfs_rq->curr, but do find someone in the tree. Then we _must_ do vruntime = se->vruntime, because vruntime = min_vruntime(vruntime := cfs_rq->min_vruntime, se->vruntime) will not advance vruntime, and cause lags the other way around (which we fixed with that initial patch: `1af5f730fc` (sched: more accurate min_vruntime accounting). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Mike Galbraith <efault@gmx.de> Acked-by: Mike Galbraith <efault@gmx.de> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-15 15:12:19 +01:00
Peter Zijlstra	6bc912b71b	sched: SCHED_OTHER vs SCHED_IDLE isolation Stronger SCHED_IDLE isolation: - no SCHED_IDLE buddies - never let SCHED_IDLE preempt on wakeup - always preempt SCHED_IDLE on wakeup - limit SLEEPER fairness for SCHED_IDLE. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-15 15:07:29 +01:00
Peter Zijlstra	cce7ade803	sched: SCHED_IDLE weight change Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable vruntime numbers. time advanced in 100ms: weight=2 64765.988352 67012.881408 88501.412352 weight=3 35496.181411 34130.971298 35497.411573 Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-15 15:07:28 +01:00
Peter Zijlstra	98a4826b99	sched: fix bandwidth validation for UID grouping Impact: make rt-limit tunables work again Mark Glines reported: > I've got an issue on x86-64 where I can't configure the system to allow > RT tasks for a non-root user. > > In 2.6.26.5, I was able to do the following to set things up nicely: > echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime > echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime > > Seems like every value I try to echo into the /sys files returns EINVAL. For UID grouping we initialize the root group with infinite bandwidth which by default is actually more than the global limit, therefore the bandwidth check always fails. Because the root group is a phantom group (for UID grouping) we cannot runtime adjust it, therefore we let it reflect the global bandwidth settings. Reported-by: Mark Glines <mark@glines.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-15 15:07:27 +01:00
Jaswinder Singh Rajput	934d96eafa	time-sched.c: tick_nohz_update_jiffies should be static Impact: cleanup, reduce kernel size a bit, avoid sparse warning Fixes sparse warning: kernel/time/tick-sched.c:137:6: warning: symbol 'tick_nohz_update_jiffies' was not declared. Should it be static? Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-15 12:06:56 +01:00
Linus Torvalds	bca268565f	Merge branch 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6 * 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (44 commits) [CVE-2009-0029] s390 specific system call wrappers [CVE-2009-0029] System call wrappers part 33 [CVE-2009-0029] System call wrappers part 32 [CVE-2009-0029] System call wrappers part 31 [CVE-2009-0029] System call wrappers part 30 [CVE-2009-0029] System call wrappers part 29 [CVE-2009-0029] System call wrappers part 28 [CVE-2009-0029] System call wrappers part 27 [CVE-2009-0029] System call wrappers part 26 [CVE-2009-0029] System call wrappers part 25 [CVE-2009-0029] System call wrappers part 24 [CVE-2009-0029] System call wrappers part 23 [CVE-2009-0029] System call wrappers part 22 [CVE-2009-0029] System call wrappers part 21 [CVE-2009-0029] System call wrappers part 20 [CVE-2009-0029] System call wrappers part 19 [CVE-2009-0029] System call wrappers part 18 [CVE-2009-0029] System call wrappers part 17 [CVE-2009-0029] System call wrappers part 16 [CVE-2009-0029] System call wrappers part 15 ...	2009-01-14 19:58:40 -08:00
Sam Ravnborg	2ea038917b	Revert "kbuild: strip generated symbols from *.ko" This reverts commit `ad7a953c52`. And commit: ("allow stripping of generated symbols under CONFIG_KALLSYMS_ALL") `9bb482476c` These stripping patches has caused a set of issues: 1) People have reported compatibility issues with binutils due to lack of support for `--strip-unneeded-symbols' with objcopy 2.15.92.0.2 Reported by: Wenji 2) ccache and distcc no longer works as expeced Reported by: Ted, Roland, + others 3) The installed modules increased a lot in size Reported by: Ted, Davej + others Reported-by: Wenji Huang <wenji.huang@oracle.com> Reported-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Dave Jones <davej@redhat.com> Reported-by: Roland McGrath <roland@redhat.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>	2009-01-14 21:38:20 +01:00
Andrew Morton	9316fcacb8	kernel/up.c: omit it if SMP=y, USE_GENERIC_SMP_HELPERS=n Fix the sparc build - we were including `up.o' on SMP builds, when CONFIG_USE_GENERIC_SMP_HELPERS=n. Tested-by: Robert Reif <reif@earthlink.net> Fixed-by: Robert Reif <reif@earthlink.net> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-14 09:42:11 -08:00
Heiko Carstens	d4e82042c4	[CVE-2009-0029] System call wrappers part 32 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:31 +01:00
Heiko Carstens	836f92adf1	[CVE-2009-0029] System call wrappers part 31 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:31 +01:00
Heiko Carstens	6559eed8ca	[CVE-2009-0029] System call wrappers part 30 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:30 +01:00
Heiko Carstens	1e7bfb2134	[CVE-2009-0029] System call wrappers part 27 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:29 +01:00
Heiko Carstens	c4ea37c26a	[CVE-2009-0029] System call wrappers part 26 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:29 +01:00
Heiko Carstens	e48fbb699f	[CVE-2009-0029] System call wrappers part 24 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:28 +01:00
Heiko Carstens	5a8a82b1d3	[CVE-2009-0029] System call wrappers part 23 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:28 +01:00
Heiko Carstens	003d7ab479	[CVE-2009-0029] System call wrappers part 19 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:26 +01:00
Heiko Carstens	a6b42e83f2	[CVE-2009-0029] System call wrappers part 18 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:25 +01:00
Heiko Carstens	ca013e945b	[CVE-2009-0029] System call wrappers part 17 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:25 +01:00
Heiko Carstens	a5f8fa9e9b	[CVE-2009-0029] System call wrappers part 09 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:21 +01:00
Heiko Carstens	17da2bd90a	[CVE-2009-0029] System call wrappers part 08 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:21 +01:00
Heiko Carstens	754fe8d297	[CVE-2009-0029] System call wrappers part 07 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:20 +01:00
Heiko Carstens	5add95d4f7	[CVE-2009-0029] System call wrappers part 06 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:20 +01:00
Heiko Carstens	362e9c07c7	[CVE-2009-0029] System call wrappers part 05 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:20 +01:00
Heiko Carstens	b290ebe2c4	[CVE-2009-0029] System call wrappers part 04 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:19 +01:00
Heiko Carstens	ae1251ab78	[CVE-2009-0029] System call wrappers part 03 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:19 +01:00
Heiko Carstens	dbf040d9d1	[CVE-2009-0029] System call wrappers part 02 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:19 +01:00
Heiko Carstens	58fd3aa288	[CVE-2009-0029] System call wrappers part 01 Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:18 +01:00
Heiko Carstens	f627a741d2	[CVE-2009-0029] Make sys_syslog a conditional system call Remove the -ENOSYS implementation for !CONFIG_PRINTK and use the cond_syscall infrastructure instead. Acked-by: Kyle McMartin <kyle@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:16 +01:00
Heiko Carstens	2ed7c03ec1	[CVE-2009-0029] Convert all system calls to return a long Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2009-01-14 14:15:14 +01:00
Ingo Molnar	14819ea1e0	irq: export __set_irq_handler() and handle_level_irq() Impact: build fix ARM updates broke x86 allmodconfig builds: ERROR: "__set_irq_handler" [drivers/mfd/pcf50633-core.ko] undefined! ERROR: "handle_level_irq" [drivers/mfd/pcf50633-core.ko] undefined! Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-14 12:34:21 +01:00
Mandeep Singh Baines	baf48f6577	softlock: fix false panic which can occur if softlockup_thresh is reduced At run-time, if softlockup_thresh is changed to a much lower value, touch_timestamp is likely to be much older than the new softlock_thresh. This will cause a false softlockup to be detected. If softlockup_panic is enabled, the system will panic. The fix is to touch all watchdogs before changing softlockup_thresh. Signed-off-by: Mandeep Singh Baines <msb@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-14 11:48:07 +01:00
Lai Jiangshan	e4fa4c9701	rcu: add __cpuinit to rcu_init_percpu_data() Impact: reduce memory footprint add __cpuinit to rcu_init_percpu_data(), and this function's text will be discarded after boot when !CONFIG_HOTPLUG_CPU. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-14 11:26:40 +01:00
Linus Torvalds	28839855bf	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: smp_call_function_single(): be slightly less stupid, fix #2 lockdep, mm: fix might_fault() annotation	2009-01-13 09:02:21 -08:00
Arjan van de Ven	37a76bd4f1	async: fix __lowest_in_progress() At 37000 feet somewhere near Greenland I woke up from a half-sleep with the realisation that __lowest_in_progress() is buggy. After landing I checked and there were indeed 2 problems with it; this patch fixes both: * The order of the list checks was wrong * The locking was not correct. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-12 16:39:58 -08:00
Linus Torvalds	12847095e9	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kernel/sched.c: add missing forward declaration for 'double_rq_lock' sched: partly revert "sched debug: remove NULL checking in print_cfs_rt_rq()" cpumask: fix CONFIG_NUMA=y sched.c	2009-01-12 16:29:00 -08:00
Ingo Molnar	6e96281412	smp_call_function_single(): be slightly less stupid, fix #2 fix m68k build failure: tip/kernel/up.c: In function 'smp_call_function_single': tip/kernel/up.c:16: error: dereferencing pointer to incomplete type make[2]: *** [kernel/up.o] Error 1 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-12 16:04:37 +01:00
Ingo Molnar	01e3eb8227	Revert "sched: improve preempt debugging" This reverts commit `7317d7b87e`. This has been reported (and bisected) by Alexey Zaytsev and Kamalesh Babulal to produce annoying warnings during bootup on both x86 and powerpc. kernel_locked() is not a valid test in IRQ context (we update the BKL's ->lock_depth and the preempt count separately and non-atomicalyy), so we cannot put it into the generic preempt debugging checks which can run in IRQ contexts too. Reported-and-bisected-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Reported-and-bisected-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-12 13:00:50 +01:00
Steven Noonan	783adf42cf	kernel/fork.c: unused variable 'ret' Removed the unused variable. Signed-off-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-11 15:34:10 +01:00
Ingo Molnar	d19b85db9d	Merge commit 'v2.6.29-rc1' into timers/urgent	2009-01-11 15:34:05 +01:00
Steven Noonan	fd2ab30b65	kernel/sched.c: add missing forward declaration for 'double_rq_lock' Impact: build fix on certain configs Added 'double_rq_lock' forward declaration, allowing double_rq_lock to be used in _double_lock_balance(). Signed-off-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-11 13:06:07 +01:00
Ingo Molnar	93423b8665	smp_call_function_single(): be slightly less stupid, fix Impact: build fix on Alpha kernel/up.c: In function 'smp_call_function_single': kernel/up.c:12: error: 'cpuid' undeclared (first use in this function) kernel/up.c:12: error: (Each undeclared identifier is reported only once kernel/up.c:12: error: for each function it appears in.) The typo didnt show up on x86 because 'cpuid' happens to be a function address as well ... Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-11 05:15:21 +01:00
Andrew Morton	53ce3d9564	smp_call_function_single(): be slightly less stupid If you do smp_call_function_single(expression-with-side-effects, ...) then expression-with-side-effects never gets evaluated on UP builds. As always, implementing it in C is the correct thing to do. While we're there, uninline it for size and possible header dependency reasons. And create a new kernel/up.c, as a place in which to put uniprocessor-specific code and storage. It should mirror kernel/smp.c. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-11 03:41:58 +01:00
Ingo Molnar	abede81c4f	Merge commit 'v2.6.29-rc1' into core/urgent	2009-01-11 03:41:39 +01:00
Li Zefan	805194c35b	sched: partly revert "sched debug: remove NULL checking in print_cfs_rt_rq()" Impact: avoid accessing NULL tg.css->cgroup In commit `0a0db8f5c9`, I removed checking NULL tg.css->cgroup, but I realized I was wrong when I found reading /proc/sched_debug can race with cgroup_create(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-11 02:40:32 +01:00
Rusty Russell	62ea9ceb17	cpumask: fix CONFIG_NUMA=y sched.c Impact: fix panic on ia64 with NR_CPUS=1024 struct sched_domain is now a dangling structure; where we really want static ones, we need to use static_sched_domain. (As the FIXME in this file says, cpumask_var_t would be better, but this code is hairy enough without trying to add initialization code to the right places). Reported-by: Mike Travis <travis@sgi.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-11 01:04:16 +01:00
Linus Torvalds	9a100a4464	Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2 * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2: async: make async a command line option for now partial revert of asynchronous inode delete	2009-01-09 15:32:26 -08:00
Linus Torvalds	c40f6f8bbc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu: NOMMU: Support XIP on initramfs NOMMU: Teach kobjsize() about VMA regions. FLAT: Don't attempt to expand the userspace stack to fill the space allocated FDPIC: Don't attempt to expand the userspace stack to fill the space allocated NOMMU: Improve procfs output using per-MM VMAs NOMMU: Make mmap allocation page trimming behaviour configurable. NOMMU: Make VMAs per MM as for MMU-mode linux NOMMU: Delete askedalloc and realalloc variables NOMMU: Rename ARM's struct vm_region NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()	2009-01-09 14:00:58 -08:00
Linus Torvalds	1a7d0f0bec	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: CRED: Fix commit_creds() on a process that has no mm	2009-01-09 13:59:25 -08:00
Arjan van de Ven	cdb80f630b	async: make async a command line option for now ... and have it default off. This does allow people to work with it for testing. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-01-09 13:23:45 -08:00
Linus Torvalds	4ce5f24193	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile: (31 commits) powerpc/oprofile: fix whitespaces in op_model_cell.c powerpc/oprofile: IBM CELL: add SPU event profiling support powerpc/oprofile: fix cell/pr_util.h powerpc/oprofile: IBM CELL: cleanup and restructuring oprofile: make new cpu buffer functions part of the api oprofile: remove #ifdef CONFIG_OPROFILE_IBS in non-ibs code ring_buffer: fix ring_buffer_event_length() oprofile: use new data sample format for ibs oprofile: add op_cpu_buffer_get_data() oprofile: add op_cpu_buffer_add_data() oprofile: rework implementation of cpu buffer events oprofile: modify op_cpu_buffer_read_entry() oprofile: add op_cpu_buffer_write_reserve() oprofile: rename variables in add_ibs_begin() oprofile: rename add_sample() in cpu_buffer.c oprofile: rename variable ibs_allowed to has_ibs in op_model_amd.c oprofile: making add_sample_entry() inline oprofile: remove backtrace code for ibs oprofile: remove unused ibs macro oprofile: remove unused components in struct oprofile_cpu_buffer ...	2009-01-09 12:43:06 -08:00
Linus Torvalds	a3a798c88a	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (94 commits) ACPICA: hide private headers ACPICA: create acpica/ directory ACPI: fix build warning ACPI : Use RSDT instead of XSDT by adding boot option of "acpi=rsdt" ACPI: Avoid array address overflow when _CST MWAIT hint bits are set fujitsu-laptop: Simplify SBLL/SBL2 backlight handling fujitsu-laptop: Add BL power, LED control and radio state information ACPICA: delete utcache.c ACPICA: delete acdisasm.h ACPICA: Update version to 20081204. ACPICA: FADT: Update error msgs for consistency ACPICA: FADT: set acpi_gbl_use_default_register_widths to TRUE by default ACPICA: FADT parsing changes and fixes ACPICA: Add ACPI_MUTEX_TYPE configuration option ACPICA: Fixes for various ACPI data tables ACPICA: Restructure includes into public/private ACPI: remove private acpica headers from driver files ACPI: reboot.c: use new acpi_reset interface ACPICA: New: acpi_reset interface - write to reset register ACPICA: Move all public H/W interfaces to new hwxface ...	2009-01-09 11:55:14 -08:00
David Howells	43529c9712	CRED: Must initialise the new creds in prepare_kernel_cred() The newly allocated creds in prepare_kernel_cred() must be initialised before get_uid() and get_group_info() can access them. They should be copied from the old credentials. Reported-by: Steve Dickson <steved@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 11:53:53 -08:00
David Howells	0de3368141	CRED: Missing put_cred() in prepare_kernel_cred() Missing put_cred() in the error handling path of prepare_kernel_cred(). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-09 11:53:53 -08:00
Len Brown	b2576e1d44	Merge branch 'linus' into release	2009-01-09 03:39:43 -05:00
Len Brown	3cc8a5f4ba	Merge branch 'suspend' into release	2009-01-09 03:38:15 -05:00
Arjan van de Ven	33b04b9308	async: make async_synchronize_full() more serializing turns out that there are real problems with allowing async tasks that are scheduled from async tasks to run after the async_synchronize_full() returns. This patch makes the _full more strict and a complete synchronization. Later I might need to add back a lighter form of synchronization for other uses.. but not right now. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 12:58:09 -08:00
Wu Fengguang	df4927bf6c	generic swap(): sched: remove local swap() macro Use the new generic implementation. Signed-off-by: Wu Fengguang <wfg@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:15 -08:00
Eric W. Biederman	61bce0f137	pid: generalize task_active_pid_ns Currently task_active_pid_ns is not safe to call after a task becomes a zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By reading the pid namespace from the pid of the task we can trivially solve this problem at the cost of one extra memory read in what should be the same cacheline as we read the namespace from. When moving things around I have made task_active_pid_ns out of line because keeping it in pid_namespace.h would require adding includes of pid.h and sched.h that I don't think we want. This change does make task_active_pid_ns unsafe to call during copy_process until we attach a pid on the task_struct which seems to be a reasonable trade off. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Bastian Blank <bastian@waldi.eu.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:12 -08:00
Li Zefan	6af866af34	cpuset: remove remaining pointers to cpumask_t Impact: cleanups, use new cpumask API Final trivial cleanups: mainly s/cpumask_t/struct cpumask Note there is a FIXME in generate_sched_domains(). A future patch will change struct cpumask doms to struct cpumask doms[]. (I suppose Rusty will do this.) Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Li Zefan	300ed6cbb7	cpuset: convert cpuset->cpus_allowed to cpumask_var_t Impact: use new cpumask API This patch mainly does the following things: - change cs->cpus_allowed from cpumask_t to cpumask_var_t - call alloc_bootmem_cpumask_var() for top_cpuset in cpuset_init_early() - call alloc_cpumask_var() for other cpusets - replace cpus_xxx() to cpumask_xxx() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Li Zefan	645fcc9d2f	cpuset: don't allocate trial cpuset on stack Impact: cleanups, reduce stack usage This patch prepares for the next patch. When we convert cpuset.cpus_allowed to cpumask_var_t, (trialcs = cs) no longer works. Another result of this patch is reducing stack usage of trialcs. sizeof(cs) can be as large as 148 bytes on x86_64, so it's really not good to have it on stack. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Li Zefan	2341d1b659	cpuset: convert cpuset_attach() to use cpumask_var_t Impact: reduce stack usage Allocate a global cpumask_var_t at boot, and use it in cpuset_attach(), so we won't fail cpuset_attach(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Li Zefan	5771f0a223	cpuset: remove on stack cpumask_t in cpuset_can_attach() Impact: reduce stack usage Just use cs->cpus_allowed, and no need to allocate a cpumask_var_t. Signed-off-by: Li Zefan <lizf@cn.fujistu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Li Zefan	5a7625df72	cpuset: remove on stack cpumask_t in cpuset_sprintf_cpulist() This patchset converts cpuset to use new cpumask API, and thus remove on stack cpumask_t to reduce stack usage. Before: # cat kernel/cpuset.c include/linux/cpuset.h \| grep -c cpumask_t 21 After: # cat kernel/cpuset.c include/linux/cpuset.h \| grep -c cpumask_t 0 This patch: Impact: reduce stack usage It's safe to call cpulist_scnprintf inside callback_mutex, and thus we can just remove the cpumask_t and no need to allocate a cpumask_var_t. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Miao Xie	f5813d9427	cpusets: set task's cpu_allowed to cpu_possible_map when attaching it into top cpuset I found a bug on my dual-cpu box. I created a sub cpuset in top cpuset and assign 1 to its cpus. And then we attach some tasks into this sub cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset are moved into top cpuset automatically because there is no cpu in sub cpuset. Then we online CPU1, we find all the tasks which doesn't belong to top cpuset originally just run on CPU0. We fix this bug by setting task's cpu_allowed to cpu_possible_map when attaching it into top cpuset. This method needn't modify the current behavior of cpusets on CPU hotplug, and all of tasks in top cpuset use cpu_possible_map to initialize their cpu_allowed. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Lai Jiangshan	13337714f3	cpuset: rcu_read_lock() to protect task_cs() task_cs() calls task_subsys_state(). We must use rcu_read_lock() to protect cgroup_subsys_state(). It's correct that top_cpuset is never freed, but cgroup_subsys_state() accesses css_set, this css_set maybe freed when task_cs() called. We use use rcu_read_lock() to protect it. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:11 -08:00
Paul Menage	e7c5ec9193	cgroups: add css_tryget() Add css_tryget(), that obtains a counted reference on a CSS. It is used in situations where the caller has a "weak" reference to the CSS, i.e. one that does not protect the cgroup from removal via a reference count, but would instead be cleaned up by a destroy() callback. css_tryget() will return true on success, or false if the cgroup is being removed. This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but with the difference that in the event of css_tryget() racing with a cgroup_rmdir(), css_tryget() will only return false if the cgroup really does get removed. This implementation is done by biasing css->refcnt, so that a refcnt of 1 means "releasable" and 0 means "released or releasing". In the event of a race, css_tryget() distinguishes between "released" and "releasing" by checking for the CSS_REMOVED flag in css->flags. Signed-off-by: Paul Menage <menage@google.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:10 -08:00
Paul Menage	999cd8a450	cgroups: add a per-subsystem hierarchy_mutex These patches introduce new locking/refcount support for cgroups to reduce the need for subsystems to call cgroup_lock(). This will ultimately allow the atomicity of cgroup_rmdir() (which was removed recently) to be restored. These three patches give: 1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can use to prevent changes to its own cgroup tree 2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the memory controller 3/3 - introduce a css_tryget() function similar to the one recently proposed by Kamezawa, but avoiding spurious refcount failures in the event of a race between a css_tryget() and an unsuccessful cgroup_rmdir() Future patches will likely involve: - using hierarchy mutex in place of cgroup_lock() in more subsystems where appropriate - restoring the atomicity of cgroup_rmdir() with respect to cgroup_create() This patch: Add a hierarchy_mutex to the cgroup_subsys object that protects changes to the hierarchy observed by that subsystem. It is taken by the cgroup subsystem (in addition to cgroup_mutex) for the following operations: - linking a cgroup into that subsystem's cgroup tree - unlinking a cgroup from that subsystem's cgroup tree - moving the subsystem to/from a hierarchy (including across the bind() callback) Thus if the subsystem holds its own hierarchy_mutex, it can safely traverse its own hierarchy. Signed-off-by: Paul Menage <menage@google.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:10 -08:00
Balbir Singh	28dbc4b6a0	memcg: memory cgroup resource counters for hierarchy Add support for building hierarchies in resource counters. Cgroups allows us to build a deep hierarchy, but we currently don't link the resource counters belonging to the memory controller control groups, in the same fashion as the corresponding cgroup entries in the cgroup hierarchy. This patch provides the infrastructure for resource counters that have the same hiearchy as their cgroup counter parts. These set of patches are based on the resource counter hiearchy patches posted by Pavel Emelianov. NOTE: Building hiearchies is expensive, deeper hierarchies imply charging the all the way up to the root. It is known that hiearchies are expensive, so the user needs to be careful and aware of the trade-offs before creating very deep ones. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:05 -08:00
Paul Menage	a47295e6bc	cgroups: make cgroup_path() RCU-safe Fix races between /proc/sched_debug by freeing cgroup objects via an RCU callback. Thus any cgroup reference obtained from an RCU-safe source will remain valid during the RCU section. Since dentries are also RCU-safe, this allows us to traverse up the tree safely. Additionally, make cgroup_path() check for a NULL cgrp->dentry to avoid trying to report a path for a partially-created cgroup. [lizf@cn.fujitsu.com: call deactive_super() in cgroup_diput()] Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:03 -08:00
Gowrishankar M	e7b80bb695	cgroups: skip processes from other namespaces when listing a cgroup Once tasks are populated from system namespace inside cgroup, container replaces other namespace task with 0 while listing tasks, inside container. Though this is expected behaviour from container end, there is no use of showing unwanted 0s. In this patch, we check if a process is in same namespace before loading into pid array. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Gowrishankar M <gowrishankar.m@in.ibm.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:03 -08:00
Li Zefan	c12f65d439	cgroups: introduce link_css_set() to remove duplicate code Add a common function link_css_set() to link a css_set to a cgroup. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:03 -08:00
Li Zefan	33a68ac1c1	cgroups: add inactive subsystems to rootnode.subsys_list Though for an inactive hierarchy, we have subsys->root == &rootnode, but rootnode's subsys_list is always empty. This conflicts with the code in find_css_set(): for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { ... if (ss->root->subsys_list.next == &ss->sibling) { ... } } if (list_empty(&rootnode.subsys_list)) { ... } The above code assumes rootnode.subsys_list links all inactive hierarchies. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:03 -08:00
Li Zefan	e5f6a8609b	cgroups: make root_list contains active hierarchies only Don't link rootnode to the root list, so root_list contains active hierarchies only as the comment indicates. And rename for_each_root() to for_each_active_root(). Also remove redundant check in cgroup_kill_sb(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:03 -08:00
Lai Jiangshan	7534432dcc	cgroups: remove rcu_read_lock() in cgroupstats_build() cgroup_iter_* do not need rcu_read_lock(). In cgroup_enable_task_cg_lists(), do_each_thread() and while_each_thread() are protected by RCU, it's OK, for write_lock(&css_set_lock) implies rcu_read_lock() in non-RT kernel. If we need explicit rcu_read_lock(), we should add rcu_read_lock() in cgroup_enable_task_cg_lists(), not cgroup_iter_*. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:03 -08:00
Lai Jiangshan	77efecd9e0	cgroups: call find_css_set() safely in cgroup_attach_task() In cgroup_attach_task(), tsk maybe exit when we call find_css_set(). and find_css_set() will access to invalid css_set. This patch increases the count before get_css_set(), and decreases it after find_css_set(). NOTE: css_set's refcount is also taskcount, after this patch applied, taskcount may be off-by-one WHEN cgroup_lock() is not held. but I reviewed other code which use taskcount, they are still correct. No regression found by reviewing and simply testing. So I do not use two counters in css_set. (one counter for taskcount, the other for refcount. like struct mm_struct) If this fix cause regression, we will use two counters in css_set. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:02 -08:00
Lai Jiangshan	104cbd5537	cgroups: use task_lock() for access tsk->cgroups safe in cgroup_clone() Use task_lock() protect tsk->cgroups and get_css_set(tsk->cgroups). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:02 -08:00
Lai Jiangshan	b2aa30f7bb	cgroups: don't put struct cgroupfs_root protected by RCU We don't access struct cgroupfs_root in fast path, so we should not put struct cgroupfs_root protected by RCU But the comment in struct cgroup_subsys.root confuse us. struct cgroup_subsys.root is used in these places: 1 find_css_set(): if (ss->root->subsys_list.next == &ss->sibling) 2 rebind_subsystems(): if (ss->root != &rootnode) rcu_assign_pointer(ss->root, root); rcu_assign_pointer(subsys[i]->root, &rootnode); 3 cgroup_has_css_refs(): if (ss->root != cgrp->root) 4 cgroup_init_subsys(): ss->root = &rootnode; 5 proc_cgroupstats_show(): ss->name, ss->root->subsys_bits, ss->root->number_of_cgroups, !ss->disabled); 6 cgroup_clone(): root = subsys->root; if ((root != subsys->root) \|\| All these place we have held cgroup_lock() or we don't dereference to struct cgroupfs_root. It's means wo don't need RCU when use struct cgroup_subsys.root, and we should not put struct cgroupfs_root protected by RCU. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:02 -08:00
Lai Jiangshan	2019f634ce	cgroups: fix cgroup_iter_next() bug We access res->cgroups without the task_lock(), so res->cgroups may be changed. it's unreliable, and "if (l == &res->cgroups->tasks)" may be false forever. We don't need add any lock for fixing this bug. we just access to struct css_set by struct cg_cgroup_link, not by struct task_struct. Since we hold css_set_lock, struct cg_cgroup_link is reliable. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:02 -08:00
Lai Jiangshan	b12b533fa5	cgroups: add lock for child->cgroups in cgroup_post_fork() When cgroup_post_fork() is called, child is seen by find_task_by_vpid(), so child->cgroups maybe be changed, It'll incorrect. child->cgroups<old>'s refcnt is decreased child->cgroups<new>'s refcnt is increased but child->cg_list is added to child->cgroups<old>'s list. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:02 -08:00
Li Zefan	cae7a366f7	ns_cgroup: remove unused spinlock I happened to find the spinlock in struct ns_cgroup is never used. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:02 -08:00
Li Zefan	75139b8274	cgroups: remove some redundant NULL checks - In cgroup_clone(), if vfs_mkdir() returns successfully, dentry->d_fsdata will be the pointer to the newly created cgroup and won't be NULL. - a cgroup file's dentry->d_fsdata won't be NULL, guaranteed by cgroup_add_file(). - When walking through the subsystems of a cgroup_fs (using for_each_subsys), cgrp->subsys[ss->subsys_id] won't be NULL, guaranteed by cgroup_create(). (Also remove 2 unused variables in cgroup_rmdir(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-08 08:31:01 -08:00
Robert Richter	d2852b932f	Merge branch 'oprofile/ring_buffer' into oprofile/oprofile-for-tip	2009-01-08 14:27:34 +01:00
David Howells	b9456371a7	CRED: Fix commit_creds() on a process that has no mm Fix commit_creds()'s handling of a process that has no mm (such as one that is calling or has called daemonize()). commit_creds() should check to see if task->mm is not NULL before calling set_dumpable() on it. Reported-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-01-08 23:13:56 +11:00
Paul Mundt	dd8632a12e	NOMMU: Make mmap allocation page trimming behaviour configurable. NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to the nearest power-of-2 number of pages. Currently it then discards the excess pages back to the page allocator, making that memory available for use by other things. This can, however, cause greater amount of fragmentation. To counter this, a sysctl is added in order to fine-tune the trimming behaviour. The default behaviour remains to trim pages aggressively, while this can either be disabled completely or set to a higher page-granular watermark in order to have finer-grained control. vm region vm_top bits taken from an earlier patch by David Howells. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com>	2009-01-08 12:04:47 +00:00
David Howells	8feae13110	NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org>	2009-01-08 12:04:47 +00:00
Linus Torvalds	8cfc7f9c00	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: fix possible recursive rq->lock	2009-01-07 15:43:58 -08:00
Linus Torvalds	b424e8d3b4	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (98 commits) PCI PM: Put PM callbacks in the order of execution PCI PM: Run default PM callbacks for all devices using new framework PCI PM: Register power state of devices during initialization PCI PM: Call pci_fixup_device from legacy routines PCI PM: Rearrange code in pci-driver.c PCI PM: Avoid touching devices behind bridges in unknown state PCI PM: Move pci_has_legacy_pm_support PCI PM: Power-manage devices without drivers during suspend-resume PCI PM: Add suspend counterpart of pci_reenable_device PCI PM: Fix poweroff and restore callbacks PCI: Use msleep instead of cpu_relax during ASPM link retraining PCI: PCIe portdrv: Add kerneldoc comments to remining core funtions PCI: PCIe portdrv: Rearrange code so that related things are together PCI: PCIe portdrv: Fix suspend and resume of PCI Express port services PCI: PCIe portdrv: Add kerneldoc comments to some core functions x86/PCI: Do not use interrupt links for devices using MSI-X net: sfc: Use pci_clear_master() to disable bus mastering PCI: Add pci_clear_master() as opposite of pci_set_master() PCI hotplug: remove redundant test in cpq hotplug PCI: pciehp: cleanup register and field definitions ...	2009-01-07 15:41:01 -08:00
Linus Torvalds	67acd8b4b7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async * git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async: async: don't do the initcall stuff post boot bootchart: improve output based on Dave Jones' feedback async: make the final inode deletion an asynchronous event fastboot: Make libata initialization even more async fastboot: make the libata port scan asynchronous fastboot: make scsi probes asynchronous async: Asynchronous function calls to speed up kernel boot	2009-01-07 15:35:47 -08:00
Paul E. McKenney	c9d557c19f	rcu: fix bug in rcutorture system-shutdown code This patch fixes an rcutorture bug found by Eric Sesterhenn that resulted in oopses in response to "rmmod rcutorture". The problem was in some new code that attempted to handle the case where a system is shut down while rcutorture is still running, for example, when rcutorture is built into the kernel so that it cannot be removed. The fix causes the rcutorture threads to "park" in an schedule_timeout_uninterruptible(MAX_SCHEDULE_TIMEOUT) rather than trying to get them to terminate cleanly. Concurrent shutdown and rmmod is illegal. I believe that this is 2.6.29 material, as it is used in some testing setups. For reference, here are the rcutorture operating modes: CONFIG_RCU_TORTURE_TEST=m This is the normal rcutorture build. Use "modprobe rcutorture" (with optional arguments) to start, and "rmmod rcutorture" to stop. If you shut the system down without doing the rmmod, you should see console output like: rcutorture thread rcu_torture_writer parking due to system shutdown One for each rcutorture kthread. CONFIG_RCU_TORTURE_TEST=y CONFIG_RCU_TORTURE_TEST_RUNNABLE=n Use this if you want rcutorture built in, but don't want the test to start running during early boot. To start the torturing: echo 1 > /proc/sys/kernel/rcutorture_runnable To stop the torturing, s/1/0/ You will get "parking" console messages as noted above when you shut the system down. CONFIG_RCU_TORTURE_TEST=y CONFIG_RCU_TORTURE_TEST_RUNNABLE=y Same as above, except that the torturing starts during early boot. Only for the stout of heart and strong of stomach. The same /proc entry noted above may be used to control the test. Located-by: Eric Sesterhenn <snakebyte@gmx.de> Tested-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-07 23:36:25 +01:00
Robert Richter	465634adc1	ring_buffer: fix ring_buffer_event_length() Function ring_buffer_event_length() provides an interface to detect the length of data stored in an entry. However, the length contains offsets depending on the internal usage. This makes it unusable. This patch fixes this and now ring_buffer_event_length() returns the alligned length that has been used in ring_buffer_lock_reserve(). Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Robert Richter <robert.richter@amd.com>	2009-01-07 22:47:40 +01:00
Heiko Carstens	a0e280e0f3	stop_machine/cpu hotplug: fix disable_nonboot_cpus disable_nonboot_cpus calls _cpu_down. But _cpu_down requires that the caller already created the stop_machine workqueue (like cpu_down does). Otherwise a call to stop_machine will lead to accesses to random memory regions. When introducing this new interface (`9ea09af3bd` "stop_machine: introduce stop_machine_create/destroy") I missed the second call site of _cpu_down. So add the missing stop_machine_create/destroy calls to disable_nonboot_cpus as well. Fixes suspend-to-ram/disk and also this bug: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3 [ 286.550598] Oops: 0002 [#1] SMP [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5 [ 286.560580] EIP: is at __stop_machine+0x88/0xe3 [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30 [ 286.560580] Call Trace: [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234 [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39 [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191 [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1 [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61 [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101 [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105 [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158 [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159 [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34 Reported-and-tested-by: "Justin P. Mattock" <justinmattock@gmail.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-07 11:36:14 -08:00
Linus Torvalds	57c44c5f6f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits) trivial: chack -> check typo fix in main Makefile trivial: Add a space (and a comma) to a printk in 8250 driver trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx trivial: Fix misspelling of "firmware" in powerpc Makefile trivial: Fix misspelling of "firmware" in usb.c trivial: Fix misspelling of "firmware" in qla1280.c trivial: Fix misspelling of "firmware" in a100u2w.c trivial: Fix misspelling of "firmware" in megaraid.c trivial: Fix misspelling of "firmware" in ql4_mbx.c trivial: Fix misspelling of "firmware" in acpi_memhotplug.c trivial: Fix misspelling of "firmware" in ipw2100.c trivial: Fix misspelling of "firmware" in atmel.c trivial: Fix misspelled firmware in Kconfig trivial: fix an -> a typos in documentation and comments trivial: fix then -> than typos in comments and documentation trivial: update Jesper Juhl CREDITS entry with new email trivial: fix singal -> signal typo trivial: Fix incorrect use of "loose" in event.c trivial: printk: fix indentation of new_text_line declaration trivial: rtc-stk17ta8: fix sparse warning ...	2009-01-07 11:31:52 -08:00
Arjan van de Ven	e8de1481fd	resource: allow MMIO exclusivity for device drivers Device drivers that use pci_request_regions() (and similar APIs) have a reasonable expectation that they are the only ones accessing their device. As part of the e1000e hunt, we were afraid that some userland (X or some bootsplash stuff) was mapping the MMIO region that the driver thought it had exclusively via /dev/mem or via various sysfs resource mappings. This patch adds the option for device drivers to cause their reserved regions to the "banned from /dev/mem use" list, so now both kernel memory and device-exclusive MMIO regions are banned. NOTE: This is only active when CONFIG_STRICT_DEVMEM is set. In addition to the config option, a kernel parameter iomem=relaxed is provided for the cases where developers want to diagnose, in the field, drivers issues from userspace. Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2009-01-07 11:12:32 -08:00
Peter Zijlstra	490dea45d0	itimers: remove the per-cpu-ish-ness Either we bounce once cacheline per cpu per tick, yielding n^2 bounces or we just bounce a single.. Also, using per-cpu allocations for the thread-groups complicates the per-cpu allocator in that its currently aimed to be a fixed sized allocator and the only possible extention to that would be vmap based, which is seriously constrained on 32 bit archs. So making the per-cpu memory requirement depend on the number of processes is an issue. Lastly, it didn't deal with cpu-hotplug, although admittedly that might be fixable. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-07 18:52:44 +01:00
Arjan van de Ven	ad160d2319	async: don't do the initcall stuff post boot while tracking the asynchronous calls during boot using the initcall_debug convention is useful, doing it once the kernel is done is actually bad now that we use asynchronous operations post boot as well... Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-01-07 09:31:49 -08:00
Arjan van de Ven	22a9d64567	async: Asynchronous function calls to speed up kernel boot Right now, most of the kernel boot is strictly synchronous, such that various hardware delays are done sequentially. In order to make the kernel boot faster, this patch introduces infrastructure to allow doing some of the initialization steps asynchronously, which will hide significant portions of the hardware delays in practice. In order to not change device order and other similar observables, this patch does NOT do full parallel initialization. Rather, it operates more in the way an out of order CPU does; the work may be done out of order and asynchronous, but the observable effects (instruction retiring for the CPU) are still done in the original sequence. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2009-01-07 08:45:46 -08:00
Peter Zijlstra	da8d5089da	sched: fix possible recursive rq->lock Vaidyanathan Srinivasan reported: > ============================================= > [ INFO: possible recursive locking detected ] > 2.6.28-autotest-tip-sv #1 > --------------------------------------------- > klogd/5062 is trying to acquire lock: > (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e > > but task is already holding lock: > (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31 With sched_mc at 2. (it is default-off) Strictly speaking we'll not deadlock, because ttwu will not be able to place the migration task on our rq, but since the code can deal with both rqs getting unlocked, this seems the easiest way out. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-01-07 16:10:54 +01:00
Linus Torvalds	c861ea2cb2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #3] Revert "CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #2]" SELinux: shrink sizeof av_inhert selinux_class_perm and context CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #2] keys: fix sparse warning by adding __user annotation to cast smack: Add support for unlabeled network hosts and networks selinux: Deprecate and schedule the removal of the the compat_net functionality netlabel: Update kernel configuration API	2009-01-06 17:11:39 -08:00
Linus Torvalds	3610639d1f	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: splitout peek ahead functionality, fix hrtimer: fixup comments hrtimer: fix recursion deadlock by re-introducing the softirq hrtimer: simplify hotplug migration hrtimer: fix HOTPLUG_CPU=n compile warning hrtimer: splitout peek ahead functionality	2009-01-06 17:10:53 -08:00
Linus Torvalds	cfa97f993c	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: fix section mismatch sched: fix double kfree in failure path sched: clean up arch_reinit_sched_domains() sched: mark sched_create_sysfs_power_savings_entries() as __init getrusage: RUSAGE_THREAD should return ru_utime and ru_stime sched: fix sched_slice() sched_clock: prevent scd->clock from moving backwards, take #2 sched: sched.c declare variables before they get used	2009-01-06 17:10:33 -08:00
Linus Torvalds	f94181da71	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: fix rcutorture bug rcu: eliminate synchronize_rcu_xxx macro rcu: make treercu safe for suspend and resume rcu: fix rcutree grace-period-latency bug on small systems futex: catch certain assymetric (get\|put)_futex_key calls futex: make futex_(get\|put)_key() calls symmetric locking, percpu counters: introduce separate lock classes swiotlb: clean up EXPORT_SYMBOL usage swiotlb: remove unnecessary declaration swiotlb: replace architecture-specific swiotlb.h with linux/swiotlb.h swiotlb: add support for systems with highmem swiotlb: store phys address in io_tlb_orig_addr array swiotlb: add hwdev to swiotlb_phys_to_bus() / swiotlb_sg_to_bus()	2009-01-06 17:10:04 -08:00
Linus Torvalds	40d7ee5d16	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (60 commits) uio: make uio_info's name and version const UIO: Documentation for UIO ioport info handling UIO: Pass information about ioports to userspace (V2) UIO: uio_pdrv_genirq: allow custom irq_flags UIO: use pci_ioremap_bar() in drivers/uio arm: struct device - replace bus_id with dev_name(), dev_set_name() libata: struct device - replace bus_id with dev_name(), dev_set_name() avr: struct device - replace bus_id with dev_name(), dev_set_name() block: struct device - replace bus_id with dev_name(), dev_set_name() chris: struct device - replace bus_id with dev_name(), dev_set_name() dmi: struct device - replace bus_id with dev_name(), dev_set_name() gadget: struct device - replace bus_id with dev_name(), dev_set_name() gpio: struct device - replace bus_id with dev_name(), dev_set_name() gpu: struct device - replace bus_id with dev_name(), dev_set_name() hwmon: struct device - replace bus_id with dev_name(), dev_set_name() i2o: struct device - replace bus_id with dev_name(), dev_set_name() IA64: struct device - replace bus_id with dev_name(), dev_set_name() i7300_idle: struct device - replace bus_id with dev_name(), dev_set_name() infiniband: struct device - replace bus_id with dev_name(), dev_set_name() ISDN: struct device - replace bus_id with dev_name(), dev_set_name() ...	2009-01-06 17:02:07 -08:00
Johannes Weiner	58c6d3dfe4	dma-coherent: catch oversized requests to dma_alloc_from_coherent() Prevent passing an order to bitmap_find_free_region() that is larger than the actual bitmap can represent. These requests can come from device drivers that have no idea how big the dma region is and need to rely on dma_alloc_from_coherent() to sort it out for them. Reported-by: Guennadi Liakhovetski <lg@denx.de> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Dmitry Baryshkov <dbaryshkov@gmail.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:31 -08:00
Andrew Morton	eccd83e116	dma_alloc_coherent: clean it up This thing was rather stupidly coded. Rework it all prior to making changes. Also, rename local variable `page': kernel readers expect something called `page' to have type `struct page *'. Cc: Guennadi Liakhovetski <lg@denx.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Dmitry Baryshkov <dbaryshkov@gmail.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:31 -08:00
Andrew Morton	0bef3c2dc7	dma_alloc_from_coherent(): fix fallback to generic memory If bitmap_find_free_region() fails and DMA_MEMORY_EXCLUSIVE is not set, the function will fail to write anything to *ret and will return 1. This will cause dma_alloc_coherent() to return an uninitialised value, crashing the kernel, perhaps via DMA to a random address. Fix that by changing it to return zero in this case, so the caller will proceed to allocate the memory from the generic memory allocator. Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Dmitry Baryshkov <dbaryshkov@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:30 -08:00
Hidehiro Kawai	4cb0e11b15	coredump_filter: permit changing of the default filter Introduce a new kernel parameter `coredump_filter'. Setting a value to this parameter causes the default bitmask of coredump_filter to be changed. It is useful for users to change coredump_filter settings for the whole system at boot time. Without this parameter, users have to change coredump_filter settings for each /proc/<pid>/ in an initializing script. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Roland McGrath <roland@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:29 -08:00
Sukadev Bhattiprolu	9cd4fd1043	SEND_SIG_NOINFO: set si_pid to tgid instead of pid POSIX requires the si_pid to be the process id of the sender, so ->si_pid should really be set to 'tgid'. This change does have following changes in behavior: - When sending pdeath_signal on re-parent to a sub-thread, ->si_pid cannot be used to identify the thread that did the re-parent since it will now show the tgid instead of thread id. - A multi-threaded application that expects to find the specific thread that encountered a SIGPIPE using the ->si_pid will now break. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-By: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:29 -08:00
Sukadev Bhattiprolu	09bca05c90	SEND_SIG_NOINFO: masquerade si_pid when crossing pid-ns boundary For SEND_SIG_NOINFO, si_pid is currently set to the pid of sender in sender's active pid namespace. But if the receiver is in a Eg: when parent sends the 'pdeath_signal' to a child that is in a descendant pid namespace, we should set si_pid 0. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-By: Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:28 -08:00
Randy Dunlap	bd4207c901	kmod: fix varargs kernel-doc Fix varargs kernel-doc format in kmod.c: Use @... instead of @varargs. Warning(kernel/kmod.c:67): Excess function parameter or struct member 'varargs' description in 'request_module' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:27 -08:00
Masami Hiramatsu	f24659d96f	kprobes: support probing module __init function Allow kprobes to probe module __init routines. When __init functions are freed, kprobes which probe those functions are set to "Gone" flag. These "Gone" probes are disarmed from the code and never be enabled. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Masami Hiramatsu	0deddf436a	module: add MODULE_STATE_LIVE notify Add a module notifier call which notifies that the state of a module changes from MODULE_STATE_COMING to MODULE_STATE_LIVE. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Masami Hiramatsu	49ad2fd76c	kprobes: remove called_from argument Remove called_from argument from kprobes which had been used for preventing self-refering of kernel module. However, since we don't keep module's refcount after registering kprobe any more, there is no reason to check that. This patch also simplifies registering/unregistering functions because we don't need to use __builtin_return_address(0) which was passed to called_from. [ananth@in.ibm.com: build fix] Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Masami Hiramatsu	e8386a0cb2	kprobes: support probing module __exit function Allows kprobes to probe __exit routine. This adds flags member to struct kprobe. When module is freed(kprobes hooks module_notifier to get this event), kprobes which probe the functions in that module are set to "Gone" flag to the flags member. These "Gone" probes are never be enabled. Users can check the GONE flag through debugfs. This also removes mod_refcounted, because we couldn't free a module if kprobe incremented the refcount of that module. [akpm@linux-foundation.org: document some locking] [mhiramat@redhat.com: bugfix: pass aggr_kprobe to arch_remove_kprobe] [mhiramat@redhat.com: bugfix: release old_p's insn_slot before error return] Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:21 -08:00
Masami Hiramatsu	017c39bdb1	kprobes: add __kprobes to kprobe internal functions Add __kprobes to kprobes internal functions for protecting from probing by kprobes itself. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:20 -08:00
Masami Hiramatsu	1294156078	kprobes: add kprobe_insn_mutex and cleanup arch_remove_kprobe() Add kprobe_insn_mutex for protecting kprobe_insn_pages hlist, and remove kprobe_mutex from architecture dependent code. This allows us to call arch_remove_kprobe() (and free_insn_slot) while holding kprobe_mutex. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:20 -08:00
Masami Hiramatsu	a06f6211ef	module: add within_module_core() and within_module_init() This series of patches allows kprobes to probe module's __init and __exit functions. This means, you can probe driver initialization and terminating. Currently, kprobes can't probe __init function because these functions are freed after module initialization. And it also can't probe module __exit functions because kprobe increments reference count of target module and user can't unload it. this means __exit functions never be called unless removing probes from the module. To solve both cases, this series of patches introduces GONE flag and sets it when the target code is freed(for this purpose, kprobes hooks MODULE_STATE_* events). This also removes refcount incrementing for allowing user to unload target module. Users can check which probes are GONE by debugfs interface. For taking timing of freeing module's .init text, these also include a patch which adds module's notifier of MODULE_STATE_LIVE event. This patch: Add within_module_core() and within_module_init() for checking whether an address is in the module .init.text section or .text section, and replace within() local inline functions in kernel/module.c with them. kprobes uses these functions to check where the kprobe is inserted. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:20 -08:00
Masami Hiramatsu	12da3b888b	kprobes: add tests for register_kprobes Add testcases for *probe batch registration (register_kprobes) to kprobes sanity tests. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:20 -08:00
Masami Hiramatsu	8e1144050e	kprobes: indirectly call kprobe_target Call kprobe_target indirectly. This prevents gcc to unroll a noinline function in caller function. I ported patches which had been discussed on http://sources.redhat.com/bugzilla/show_bug.cgi?id=3542 Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:20 -08:00
Masami Hiramatsu	bc2f70151f	kprobes: bugfix: try_module_get even if calling_mod is NULL When someone called register_probe() from kernel-core code(not from module) and that probes a kernel module, users can remove the probed module because kprobe doesn't increment reference counter of the module. (on the other hand, if the kernel-module calls register_probe, kprobe increments refcount of the probed module.) Currently, we have no register_probe() calling from kernel-core(except smoke-test, but the smoke-test doesn't probe module), so there is no real bugs. But the logic is wrong(or not fair) and it can causes a problem when someone might want to probe module from kernel. After this patch is applied, even if someone put register_probe() call in the kernel-core code, it increments the reference counter of the probed module, and it prevents user to remove the module until stopping probing it. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:20 -08:00
KOSAKI Motohiro	26e5438e4b	profile: don't include <asm/ptrace.h> twice. Currently, kernel/profile.c include <asm/ptrace.h> twice. It can be removed. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:14 -08:00
Paul Mackerras	e3d5a27d58	Allow times and time system calls to return small negative values At the moment, the times() system call will appear to fail for a period shortly after boot, while the value it want to return is between -4095 and -1. The same thing will also happen for the time() system call on 32-bit platforms some time in 2106 or so. On some platforms, such as x86, this is unavoidable because of the system call ABI, but other platforms such as powerpc have a separate error indication from the return value, so system calls can in fact return small negative values without indicating an error. On those platforms, force_successful_syscall_return() provides a way to indicate that the system call return value should not be treated as an error even if it is in the range which would normally be taken as a negative error number. This adds a force_successful_syscall_return() call to the time() and times() system calls plus their 32-bit compat versions, so that they don't erroneously indicate an error on those platforms whose system call ABI has a separate error indication. This will not affect anything on other platforms. Joakim Tjernlund added the fix for time() and the compat versions of time() and times(), after I did the fix for times(). Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:13 -08:00
Arjan van de Ven	d6624f996a	oops: increment the oops UUID every time we oops ... because we do want repeated same-oops to be seen by automated tools like kerneloops.org Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:12 -08:00
Zhaolei	60348802e9	fork.c: cleanup for copy_sighand() Check CLONE_SIGHAND only is enough, because combination of CLONE_THREAD and CLONE_SIGHAND is already done in copy_process(). Impact: cleanup, no functionality changed Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:11 -08:00
Alexey Dobriyan	f1883f86de	Remove remaining unwinder code Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Gabor Gombas <gombasg@sztaki.hu> Cc: Jan Beulich <jbeulich@novell.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:11 -08:00
Oleg Nesterov	901608d904	mm: introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses mm->hiwater_xxx directly, this leads to 2 problems: - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any moment before the task exits, so we should check the current values of rss/vm anyway. - do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can be preempted right before mm->hiwater_xxx = new_val, and another thread can use A_LOT of memory and exit in between. When the first thread resumes it can be the last thread in the thread group, in that case we report the wrong hiwater_xxx values which do not take A_LOT into account. Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change xacct_add_tsk() to use them. The first helper will also be used by rusage->ru_maxrss accounting. Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody can look at this mm_struct when exit_mmap() actually unmaps the memory. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:09 -08:00
David Rientjes	2da02997e0	mm: add dirty_background_bytes and dirty_bytes sysctls This change introduces two new sysctls to /proc/sys/vm: dirty_background_bytes and dirty_bytes. dirty_background_bytes is the counterpart to dirty_background_ratio and dirty_bytes is the counterpart to dirty_ratio. With growing memory capacities of individual machines, it's no longer sufficient to specify dirty thresholds as a percentage of the amount of dirtyable memory over the entire system. dirty_background_bytes and dirty_bytes specify quantities of memory, in bytes, that represent the dirty limits for the entire system. If either of these values is set, its value represents the amount of dirty memory that is needed to commence either background or direct writeback. When a `bytes' or `ratio' file is written, its counterpart becomes a function of the written value. For example, if dirty_bytes is written to be 8096, 8K of memory is required to commence direct writeback. dirty_ratio is then functionally equivalent to 8K / the amount of dirtyable memory: dirtyable_memory = free pages + mapped pages + file cache dirty_background_bytes = dirty_background_ratio * dirtyable_memory -or- dirty_background_ratio = dirty_background_bytes / dirtyable_memory AND dirty_bytes = dirty_ratio * dirtyable_memory -or- dirty_ratio = dirty_bytes / dirtyable_memory Only one of dirty_background_bytes and dirty_background_ratio may be specified at a time, and only one of dirty_bytes and dirty_ratio may be specified. When one sysctl is written, the other appears as 0 when read. The `bytes' files operate on a page size granularity since dirty limits are compared with ZVC values, which are in page units. Prior to this change, the minimum dirty_ratio was 5 as implemented by get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user written value between 0 and 100. This restriction is maintained, but dirty_bytes has a lower limit of only one page. Also prior to this change, the dirty_background_ratio could not equal or exceed dirty_ratio. This restriction is maintained in addition to restricting dirty_background_bytes. If either background threshold equals or exceeds that of the dirty threshold, it is implicitly set to half the dirty threshold. Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:03 -08:00
Hugh Dickins	e5991371ee	mm: remove cgroup_mm_owner_callbacks cgroup_mm_owner_callbacks() was brought in to support the memrlimit controller, but sneaked into mainline ahead of it. That controller has now been shelved, and the mm_owner_changed() args were inadequate for it anyway (they needed an mm pointer instead of a task pointer). Remove the dead code, and restore mm_update_next_owner() locking to how it was before: taking mmap_sem there does nothing for memcontrol.c, now the only user of mm->owner. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:59:01 -08:00
David Rientjes	75aa199410	oom: print triggering task's cpuset and mems allowed When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly. We also print the task's cpuset name for informational purposes. [rientjes@google.com: task lock current before dereferencing cpuset] Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-01-06 15:58:59 -08:00
James Morris	ac8cc0fa53	Merge branch 'next' into for-linus	2009-01-07 09:58:22 +11:00
David Howells	3699c53c48	CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #3 ] Fix a regression in cap_capable() due to: commit `3b11a1dece` Author: David Howells <dhowells@redhat.com> Date: Fri Nov 14 10:39:26 2008 +1100 CRED: Differentiate objective and effective subjective credentials on a task The problem is that the above patch allows a process to have two sets of credentials, and for the most part uses the subjective credentials when accessing current's creds. There is, however, one exception: cap_capable(), and thus capable(), uses the real/objective credentials of the target task, whether or not it is the current task. Ordinarily this doesn't matter, since usually the two cred pointers in current point to the same set of creds. However, sys_faccessat() makes use of this facility to override the credentials of the calling process to make its test, without affecting the creds as seen from other processes. One of the things sys_faccessat() does is to make an adjustment to the effective capabilities mask, which cap_capable(), as it stands, then ignores. The affected capability check is in generic_permission(): if (!(mask & MAY_EXEC) \|\| execute_ok(inode)) if (capable(CAP_DAC_OVERRIDE)) return 0; This change passes the set of credentials to be tested down into the commoncap and SELinux code. The security functions called by capable() and has_capability() select the appropriate set of credentials from the process being checked. This can be tested by compiling the following program from the XFS testsuite: /* * t_access_root.c - trivial test program to show permission bug. * * Written by Michael Kerrisk - copyright ownership not pursued. * Sourced from: http://linux.derkeiler.com/Mailing-Lists/Kernel/2003-10/6030.html / #include <limits.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #define UID 500 #define GID 100 #define PERM 0 #define TESTPATH "/tmp/t_access" static void errExit(char msg) { perror(msg); exit(EXIT_FAILURE); } /* errExit / static void accessTest(char file, int mask, char mstr) { printf("access(%s, %s) returns %d\n", file, mstr, access(file, mask)); } / accessTest / int main(int argc, char argv[]) { int fd, perm, uid, gid; char testpath; char cmd[PATH_MAX + 20]; testpath = (argc > 1) ? argv[1] : TESTPATH; perm = (argc > 2) ? strtoul(argv[2], NULL, 8) : PERM; uid = (argc > 3) ? atoi(argv[3]) : UID; gid = (argc > 4) ? atoi(argv[4]) : GID; unlink(testpath); fd = open(testpath, O_RDWR \| O_CREAT, 0); if (fd == -1) errExit("open"); if (fchown(fd, uid, gid) == -1) errExit("fchown"); if (fchmod(fd, perm) == -1) errExit("fchmod"); close(fd); snprintf(cmd, sizeof(cmd), "ls -l %s", testpath); system(cmd); if (seteuid(uid) == -1) errExit("seteuid"); accessTest(testpath, 0, "0"); accessTest(testpath, R_OK, "R_OK"); accessTest(testpath, W_OK, "W_OK"); accessTest(testpath, X_OK, "X_OK"); accessTest(testpath, R_OK \| W_OK, "R_OK \| W_OK"); accessTest(testpath, R_OK \| X_OK, "R_OK \| X_OK"); accessTest(testpath, W_OK \| X_OK, "W_OK \| X_OK"); accessTest(testpath, R_OK \| W_OK \| X_OK, "R_OK \| W_OK \| X_OK"); exit(EXIT_SUCCESS); } / main */ This can be run against an Ext3 filesystem as well as against an XFS filesystem. If successful, it will show: [root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043 ---------- 1 dhowells dhowells 0 2008-12-31 03:00 /tmp/xxx access(/tmp/xxx, 0) returns 0 access(/tmp/xxx, R_OK) returns 0 access(/tmp/xxx, W_OK) returns 0 access(/tmp/xxx, X_OK) returns -1 access(/tmp/xxx, R_OK \| W_OK) returns 0 access(/tmp/xxx, R_OK \| X_OK) returns -1 access(/tmp/xxx, W_OK \| X_OK) returns -1 access(/tmp/xxx, R_OK \| W_OK \| X_OK) returns -1 If unsuccessful, it will show: [root@andromeda src]# ./t_access_root /tmp/xxx 0 4043 4043 ---------- 1 dhowells dhowells 0 2008-12-31 02:56 /tmp/xxx access(/tmp/xxx, 0) returns 0 access(/tmp/xxx, R_OK) returns -1 access(/tmp/xxx, W_OK) returns -1 access(/tmp/xxx, X_OK) returns -1 access(/tmp/xxx, R_OK \| W_OK) returns -1 access(/tmp/xxx, R_OK \| X_OK) returns -1 access(/tmp/xxx, W_OK \| X_OK) returns -1 access(/tmp/xxx, R_OK \| W_OK \| X_OK) returns -1 I've also tested the fix with the SELinux and syscalls LTP testsuites. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: J. Bruce Fields <bfields@citi.umich.edu> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-01-07 09:38:48 +11:00
James Morris	29881c4502	Revert "CRED: Fix regression in cap_capable() as shown up by sys_faccessat() [ver #2 ]" This reverts commit `14eaddc967`. David has a better version to come.	2009-01-07 09:21:54 +11:00

1 2 3 4 5 ...

5982 Commits