linux

Author	SHA1	Message	Date
Rusty Russell	9b473de872	param: Fix duplicate module prefixes Instead of insisting each new module_param sysfs entry is unique, handle the case where it already exists (for builtin modules). The current code assumes that all identical prefixes are together in the section: true for normal uses, but not necessarily so if someone overrides MODULE_PARAM_PREFIX. More importantly, it's not true with the new "core_param()" code which uses "kernel" as a prefix. This simplifies the caller for the builtin case, at a slight loss of efficiency (we do the lookup every time to see if the directory exists). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-22 10:00:23 +11:00
Rusty Russell	730b69d225	module: check kernel param length at compile time, not runtime The kparam code tries to handle over-length parameter prefixes at runtime. Not only would I bet this has never been tested, it's not clear that truncating names is a good idea either. So let's check at compile time. We need to move the #define to moduleparam.h to do this, though. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:22 +11:00
Andi Kleen	d72b37513c	Remove stop_machine during module load v2 Remove stop_machine during module load v2 module loading currently does a stop_machine on each module load to insert the module into the global module lists. Especially on larger systems this can be quite expensive. It does that to handle concurrent lock lessmodule list readers like kallsyms. I don't think stop_machine() is actually needed to insert something into a list though. There are no concurrent writers because the module mutex is taken. And the RCU list functions know how to insert a node into a list with the right memory ordering so that concurrent readers don't go off into the wood. So remove the stop_machine for the module list insert and just do a list_add_rcu() instead. Module removal will still do a stop_machine of course, it needs that for other reasons. v2: Revised readers based on Paul's comments. All readers that only rely on disabled preemption need to be changed to list_for_each_rcu(). Done that. The others are ok because they have the modules mutex. Also added a possible missing preempt disable for print_modules(). [cc Paul McKenney for review. It's not RCU, but quite similar.] Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:22 +11:00
Rusty Russell	5e458cc0f4	module: simplify load_module. Linus' recent catch of stack overflow in load_module lead me to look at the code. A couple of helpers to get a section address and get objects from a section can help clean things up a little. (And in case you're wondering, the stack size also dropped from 328 to 284 bytes). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:15 +11:00
Thomas Gleixner	c4bd822e7b	NOHZ: fix thinko in the timer restart code path commit `fb02fbc14d` (NOHZ: restart tick device from irq_enter()) solves the problem of stale jiffies when long running softirqs happen in a long idle sleep period, but it has a major thinko in it: When the interrupt which came in _is_ the timer interrupt which should expire ts->sched_timer then we cancel and rearm the timer _before_ it gets expired in hrtimer_interrupt() to the next period. That means the call back function is not called. This game can go on for ever :( Prevent this by making sure to only rearm the timer when the expiry time is more than one tick_period away. Otherwise keep it running as it is either already expired or will expiry at the right point to update jiffies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Venkatesch Pallipadi <venkatesh.pallipadi@intel.com>	2008-10-21 20:53:24 +02:00
Lai Jiangshan	5f86515158	rcupdate: fix bug of rcu_barrier() current rcu_barrier_bh() is like this: void rcu_barrier_bh(void) { BUG_ON(in_interrupt()); / Take cpucontrol mutex to protect against CPU hotplug / mutex_lock(&rcu_barrier_mutex); init_completion(&rcu_barrier_completion); atomic_set(&rcu_barrier_cpu_count, 0); / * The queueing of callbacks in all CPUs must be atomic with * respect to RCU, otherwise one CPU may queue a callback, * wait for a grace period, decrement barrier count and call * complete(), while other CPUs have not yet queued anything. * So, we need to make sure that grace periods cannot complete * until all the callbacks are queued. / rcu_read_lock(); on_each_cpu(rcu_barrier_func, (void )RCU_BARRIER_BH, 1); rcu_read_unlock(); wait_for_completion(&rcu_barrier_completion); mutex_unlock(&rcu_barrier_mutex); } The inconsistency of the code and the comments show a bug here. rcu_read_lock() cannot make sure that "grace periods for RCU_BH cannot complete until all the callbacks are queued". it only make sure that race periods for RCU cannot complete until all the callbacks are queued. so we must use rcu_read_lock_bh() for rcu_barrier_bh(). like this: void rcu_barrier_bh(void) { ...... rcu_read_lock_bh(); on_each_cpu(rcu_barrier_func, (void *)RCU_BARRIER_BH, 1); rcu_read_unlock_bh(); ...... } and also rcu_barrier() rcu_barrier_sched() are implemented like this. it will bring a lot of duplicate code. My patch uses another way to fix this bug, please see the comment of my patch. Thank Paul E. McKenney for he rewrote the comment. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-21 15:59:53 +02:00
Dean Nelson	b6f3b7803a	genirq: NULL struct irq_desc's member 'name' in dynamic_irq_cleanup() If the member 'name' of the irq_desc structure happens to point to a character string that is resident within a kernel module, problems ensue if that module is rmmod'd (at which time dynamic_irq_cleanup() is called) and then later show_interrupts() is called by someone. It is also not a good thing if the character string resided in kmalloc'd space that has been kfree'd (after having called dynamic_irq_cleanup()). dynamic_irq_cleanup() fails to NULL the 'name' member and show_interrupts() references it on a few architectures (like h8300, sh and x86). Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-21 15:59:21 +02:00
Al Viro	572c489215	[PATCH] sanitize blkdev_get() and friends * get rid of fake struct file/struct dentry in __blkdev_get() * merge __blkdev_get() and do_open() * get rid of flags argument of blkdev_get() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:49:06 -04:00
Al Viro	c2dd0dae18	[PATCH] propagate mode through swsusp_close() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:49:02 -04:00
Al Viro	9a1c354276	[PATCH] pass fmode_t to blkdev_put() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:58 -04:00
Chris Friesen	0b3682ba33	genirq: fix set_irq_type() when recording trigger type Impact: fix boot hang on a G5 In set_irq_type() we want to pass the type rather than the current interrupt state. Signed-off-by: Chris Friesen <cfriesen@nortel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-21 10:10:08 +02:00
Luck, Tony	5f41b8cdc6	kexec: fix crash_save_vmcoreinfo_init build problem This fixes kernel/kexec.c: In function 'crash_save_vmcoreinfo_init': kernel/kexec.c:1374: error: 'vmlist' undeclared (first use in this function) kernel/kexec.c:1374: error: (Each undeclared identifier is reported only once kernel/kexec.c:1374: error: for each function it appears in.) kernel/kexec.c:1410: error: invalid use of undefined type 'struct vm_struct' make[1]: *** [kernel/kexec.o] Error 1 Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 15:28:50 -07:00
Linus Torvalds	92b29b86fe	Merge branch 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (131 commits) tracing/fastboot: improve help text tracing/stacktrace: improve help text tracing/fastboot: fix initcalls disposition in bootgraph.pl tracing/fastboot: fix bootgraph.pl initcall name regexp tracing/fastboot: fix issues and improve output of bootgraph.pl tracepoints: synchronize unregister static inline tracepoints: tracepoint_synchronize_unregister() ftrace: make ftrace_test_p6nop disassembler-friendly markers: fix synchronize marker unregister static inline tracing/fastboot: add better resolution to initcall debug/tracing trace: add build-time check to avoid overrunning hex buffer ftrace: fix hex output mode of ftrace tracing/fastboot: fix initcalls disposition in bootgraph.pl tracing/fastboot: fix printk format typo in boot tracer ftrace: return an error when setting a nonexistent tracer ftrace: make some tracers reentrant ring-buffer: make reentrant ring-buffer: move page indexes into page headers tracing/fastboot: only trace non-module initcalls ftrace: move pc counter in irqtrace ... Manually fix conflicts: - init/main.c: initcall tracing - kernel/module.c: verbose level vs tracepoints - scripts/bootgraph.pl: fallout from cherry-picking commits.	2008-10-20 13:35:07 -07:00
Linus Torvalds	9301975ec2	Merge branch 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu and x86/uv. The sparseirq branch is just preliminary groundwork: no sparse IRQs are actually implemented by this tree anymore - just the new APIs are added while keeping the old way intact as well (the new APIs map 1:1 to irq_desc[]). The 'real' sparse IRQ support will then be a relatively small patch ontop of this - with a v2.6.29 merge target. * 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits) genirq: improve include files intr_remapping: fix typo io_apic: make irq_mis_count available on 64-bit too genirq: fix name space collisions of nr_irqs in arch/* genirq: fix name space collision of nr_irqs in autoprobe.c genirq: use iterators for irq_desc loops proc: fixup irq iterator genirq: add reverse iterator for irq_desc x86: move ack_bad_irq() to irq.c x86: unify show_interrupts() and proc helpers x86: cleanup show_interrupts genirq: cleanup the sparseirq modifications genirq: remove artifacts from sparseirq removal genirq: revert dynarray genirq: remove irq_to_desc_alloc genirq: remove sparse irq code genirq: use inline function for irq_to_desc genirq: consolidate nr_irqs and for_each_irq_desc() x86: remove sparse irq from Kconfig genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n ...	2008-10-20 13:23:01 -07:00
Linus Torvalds	99ebcf8285	Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits) fix documentation of sysrq-q really Fix documentation of sysrq-q timer_list: add base address to clock base timer_list: print cpu number of clockevents device timer_list: print real timer address NOHZ: restart tick device from irq_enter() NOHZ: split tick_nohz_restart_sched_tick() NOHZ: unify the nohz function calls in irq_enter() timers: fix itimer/many thread hang, fix timers: fix itimer/many thread hang, v3 ntp: improve adjtimex frequency rounding timekeeping: fix rounding problem during clock update ntp: let update_persistent_clock() sleep hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds posix-timers: lock_timer: make it readable posix-timers: lock_timer: kill the bogus ->it_id check posix-timers: kill ->it_sigev_signo and ->it_sigev_value posix-timers: sys_timer_create: cleanup the error handling posix-timers: move the initialization of timer->sigq from send to create path posix-timers: sys_timer_create: simplify and s/tasklist/rcu/ ... Fix trivial conflicts due to sysrq-q description clahes in Documentation/sysrq.txt and drivers/char/sysrq.c	2008-10-20 13:19:56 -07:00
Harvey Harrison	f07767fd0f	byteorder: remove direct includes of linux/byteorder/swab[b].h A consolidated implementation will provide this generically through asm/byteorder, remove direct includes to avoid breakage when the changeover to the new implementation occurs. This hunk was lost from commit `1d8cca44b6` ("byteorder: provide swabb.h generically in asm/byteorder.h") Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 12:51:53 -07:00
Steven Rostedt	81520a1b06	ftrace: stack tracer only record when on stack The stack trace API does not record if the stack is not on the current task's stack. That is, if the stack is the interrupt stack or NMI stack, the output does not show. Also, the size of those stacks are not consistent with the size of the thread stack, this makes the calculation of the stack size usually bogus. This all confuses the stack tracer. I unfortunately do not have time to fix all these problems, but this patch does record the worst stack when the stack pointer is on the tasks stack (instead of bogus numbers). The patch simply returns if the stack pointer is not on the task's stack. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:31:37 +02:00
Steven Rostedt	3ce83aea86	ftrace: rename the ftrace tracer to function To avoid further confusion between the ftrace infrastructure and the function tracer. This patch renames the "ftrace" function tracer to "function". Now in available_tracers, instead of "ftrace" there will be "function". This makes more sense, since people will not know exactly what the "ftrace" tracer does. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:04 +02:00
Steven Rostedt	606576ce81	ftrace: rename FTRACE to FUNCTION_TRACER Due to confusion between the ftrace infrastructure and the gcc profiling tracer "ftrace", this patch renames the config options from FTRACE to FUNCTION_TRACER. The other two names that are offspring from FTRACE DYNAMIC_FTRACE and FTRACE_MCOUNT_RECORD will stay the same. This patch was generated mostly by script, and partially by hand. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:03 +02:00
Steven Rostedt	c2db8054c1	ftrace: fix depends A lot of tracers have HAVE_FTRACE as a dependent config where it really should not. The HAVE_FTRACE is a misnomer (soon to be fixed) and describes if the architecture has the function tracer (mcount) implemented. The ftrace infrastructure is implemented in all archs. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:02 +02:00
Steven Rostedt	bd95b88d9e	ftrace: release functions from hash The x86 architecture uses a static recording of mcount caller locations and is not affected by this patch. For architectures still using the dynamic ftrace daemon, this patch is critical. It removes the race between the recording of a function that calls mcount, the unloading of a module, and the ftrace daemon updating the call sites. This patch adds the releasing of the hash functions that the daemon uses to update the mcount call sites. When a module is unloaded, not only are the replaced call site table update, but now so is the hash recorded functions that the ftrace daemon will use. Again, architectures that implement MCOUNT_RECORD are not affected by this (which currently only x86 has). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:01 +02:00
Harvey Harrison	1a651a00e2	byteorder: remove direct includes of linux/byteorder/swab[b].h A consolidated implementation will provide this generically through asm/byteorder, remove direct includes to avoid breakage when the changeover to the new implementation occurs. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:40 -07:00
Ken'ichi Ohmichi	acd99dbf54	kdump: add vmlist.addr to vmcoreinfo for x86 vmalloc translation. Add the symbols 'vmlist' and offset 'vm_struct.addr' to the vmcoreinfo[1] data for i386 vmalloc translation. makedumpfile[2] needs VMALLOC_START value for distinguishing a vmalloc address or not, because it should choose suitable translation method. If applying this patch, makedumpfile will be able to take VMALLOC_START value from 'vmlist.addr'. vmcoreinfo[1]: The vmcoreinfo data has the minimum debugging information only for dump filtering. makedumpfile[2] uses it to distinguish unnecessary pages and creates a small dumpfile. makedumpfile[2]: dump filtering command https://sourceforge.net/projects/makedumpfile/ Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:40 -07:00
Oleg Nesterov	293adee601	kthread_bind: use wait_task_inactive(TASK_UNINTERRUPTIBLE) Now that wait_task_inactive(task, state) checks task->state == state, we can simplify the code and make this debugging check more robust. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Adrian Bunk	b747c8c102	make ptrace_untrace() static ptrace_untrace() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Lai Jiangshan	30e8e13603	cpuset: use seq_mask_ to print masks 1) seq_file excepts that m->count == m->size when it's buf is full, so current code will causes bugs when buf is overflow. 2) There is not too good that cpuset accesses struct seq_file's fields directly. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Rakib Mullick	40b6a76237	cpuset.c: remove extra variable Remove the use of int cpus_nonempty variable from 'update_flag' function. Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Paul Menage	cc31edceee	cgroups: convert tasks file to use a seq_file with shared pid array Rather than pre-generating the entire text for the "tasks" file each time the file is opened, we instead just generate/update the array of process ids and use a seq_file to report these to userspace. All open file handles on the same "tasks" file can share a pid array, which may be updated any time that no thread is actively reading the array. By sharing the array, the potential for userspace to DoS the system by opening many handles on the same "tasks" file is removed. [Based on a patch by Lai Jiangshan, extended to use seq_file] Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:38 -07:00
Lai Jiangshan	146aa1bd05	cgroups: fix probable race with put_css_set[_taskexit] and find_css_set put_css_set_taskexit may be called when find_css_set is called on other cpu. And the race will occur: put_css_set_taskexit side find_css_set side \| atomic_dec_and_test(&kref->refcount) \| /* kref->refcount = 0 / \| .................................................................... \| read_lock(&css_set_lock) \| find_existing_css_set \| get_css_set \| read_unlock(&css_set_lock); .................................................................... __release_css_set \| .................................................................... \| / use a released css_set */ \| [put_css_set is the same. But in the current code, all put_css_set are put into cgroup mutex critical region as the same as find_css_set.] [akpm@linux-foundation.org: repair comments] [menage@google.com: eliminate race in css_set refcounting] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:38 -07:00
WANG Cong	c3b9f5afc7	kernel/configs.c: remove useless comments These comments are useless, remove them. Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	1aece34833	container freezer: rename check_if_frozen() check_if_frozen() sounds like it should return something when in fact it's just updating the freezer state. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	81dcf33c2a	container freezer: make freezer state names less generic Rename cgroup freezer states to be less generic to avoid any name collisions while also better describing what each state is. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	957a4eeaf4	container freezer: prevent frozen tasks or cgroups from changing Don't let frozen tasks or cgroups change. This means frozen tasks can't leave their current cgroup for another cgroup. It also means that tasks cannot be added to or removed from a cgroup in the FROZEN state. We enforce these rules by checking for frozen tasks and cgroups in the can_attach() function. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	5a06915c6d	container freezer: skip frozen cgroups during power management resume When a system is resumed after a suspend, it will also unfreeze frozen cgroups. This patchs modifies the resume sequence to skip the tasks which are part of a frozen control group. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	dc52ddc0e6	container freezer: implement freezer cgroup subsystem This patch implements a new freezer subsystem in the control groups framework. It provides a way to stop and resume execution of all tasks in a cgroup by writing in the cgroup filesystem. The freezer subsystem in the container filesystem defines a file named freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup. Reading will return the current state. * Examples of usage : # mkdir /containers/freezer # mount -t cgroup -ofreezer freezer /containers # mkdir /containers/0 # echo $some_pid > /containers/0/tasks to get status of the freezer subsystem : # cat /containers/0/freezer.state RUNNING to freeze all tasks in the container : # echo FROZEN > /containers/0/freezer.state # cat /containers/0/freezer.state FREEZING # cat /containers/0/freezer.state FROZEN to unfreeze all tasks in the container : # echo RUNNING > /containers/0/freezer.state # cat /containers/0/freezer.state RUNNING This is the basic mechanism which should do the right thing for user space task in a simple scenario. It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens: 1) Userspace cancels the freezing operation by writing "RUNNING" to the freezer.state file 2) Userspace retries the freezing operation by writing "FROZEN" to the freezer.state file (writing "FREEZING" is not legal and returns EIO) 3) The tasks that blocked the cgroup from entering the "FROZEN" state disappear from the cgroup's set of tasks. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: export thaw_process] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	8174f1503f	container freezer: make refrigerator always available Now that the TIF_FREEZE flag is available in all architectures, extract the refrigerator() and freeze_task() from kernel/power/process.c and make it available to all. The refrigerator() can now be used in a control group subsystem implementing a control group freezer. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:33 -07:00
Lee Schermerhorn	af936a1606	vmscan: unevictable LRU scan sysctl This patch adds a function to scan individual or all zones' unevictable lists and move any pages that have become evictable onto the respective zone's inactive list, where shrink_inactive_list() will deal with them. Adds sysctl to scan all nodes, and per node attributes to individual nodes' zones. Kosaki: If evictable page found in unevictable lru when write /proc/sys/vm/scan_unevictable_pages, print filename and file offset of these pages. [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error] [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:31 -07:00
Ingo Molnar	0c4b83da58	sched: disable the hrtick for now David Miller reported that hrtick update overhead has tripled the wakeup overhead on Sparc64. That is too much - disable the HRTICK feature for now by default, until a faster implementation is found. Reported-by: David Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:27:43 +02:00
Peter Zijlstra	f9c0b0950d	sched: revert back to per-rq vruntime Vatsa rightly points out that having the runqueue weight in the vruntime calculations can cause unfairness in the face of task joins/leaves. Suppose: dv = dt * rw / w Then take 10 tasks t_n, each of similar weight. If the first will run 1 then its vruntime will increase by 10. Now, if the next 8 tasks leave after having run their 1, then the last task will get a vruntime increase of 2 after having run 1. Which will leave us with 2 tasks of equal weight and equal runtime, of which one will not be scheduled for 8/2=4 units of time. Ergo, we cannot do that and must use: dv = dt / w. This means we cannot have a global vruntime based on effective priority, but must instead go back to the vruntime per rq model we started out with. This patch was lightly tested by doing starting while loops on each nice level and observing their execution time, and a simple group scenario of 1:2:3 pinned to a single cpu. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:05:04 +02:00
Peter Zijlstra	a4c2f00f5c	sched: fair scheduler should not resched rt tasks With use of ftrace Steven noticed that some RT tasks got rescheduled due to sched_fair interaction. What happens is that we reprogram the hrtick from enqueue/dequeue_fair_task() because that can change nr_running, and thus a current tasks ideal runtime. However, its possible the current task isn't a fair_sched_class task, and thus doesn't have a hrtick set to change. Fix this by wrapping those hrtick_start_fair() calls in a hrtick_update() function, which will check for the right conditions. Reported-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:05:03 +02:00
Peter Zijlstra	ffda12a17a	sched: optimize group load balancer I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus in the sched_domain. This hurts. We need the rq-locks whenever we change the weight of the per-cpu group sched entities. To allevate this a little, only change the weight when the new weight is at least shares_thresh away from the old value. This avoids the rq-lock for the top level entries, since those will never be re-weighted, and fuzzes the lower level entries a little to gain performance in semi-stable situations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:05:02 +02:00
Thomas Gleixner	643bdf68f9	hrtimers: simplify hrtimer_peek_ahead_timers() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 13:38:11 +02:00
Thomas Gleixner	e1dd7bc585	hrtimers: fix docbook comments hrtimer_start() and hrtimer_start_range_ns() handle relative and absolute timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 13:33:36 +02:00
Thomas Gleixner	c465a76af6	Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus	2008-10-20 13:14:06 +02:00
Thomas Gleixner	870e2a2845	timer_list: add base address to clock base The base address of a (per cpu) clock base is a useful debug info. Add it and bump the version number of timer_lists. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 11:51:30 +02:00
Thomas Gleixner	c5b77a3d3a	timer_list: print cpu number of clockevents device The per cpu clock events device output of timer_list lacks an association of the device to the cpu which is annoying when looking at the output of /proc/timer_list from a 128 way system. Add the CPU number info and mark the broadcast device in the device list printout. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 11:51:30 +02:00
Thomas Gleixner	e67ef25a35	timer_list: print real timer address The current timer_list output prints the address of the on stack copy of the active hrtimer instead of the hrtimer itself. Print the address of the real timer instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 11:51:30 +02:00
Ingo Molnar	3e10e879a8	Merge branch 'linus' into tracing-v28-for-linus-v3 Conflicts: init/main.c kernel/module.c scripts/bootgraph.pl	2008-10-19 19:04:47 +02:00
Linus Torvalds	26e9a39777	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (25 commits) staging: at76_usb wireless driver Staging: workaround build system bug Staging: Lindent sxg.c Staging: SLICOSS: Call pci_release_regions at driver exit Staging: SLICOSS: Fix remaining type names Staging: SLICOSS: Fix warnings due to static usage Staging: SLICOSS: lots of checkpatch fixes Staging: go7007 v4l fixes Staging: Fix gcc warnings in sxg Staging: add echo cancelation module Staging: add wlan-ng prism2 usb driver Staging: add w35und wifi driver Staging: USB/IP: add host driver Staging: USB/IP: add client driver Staging: USB/IP: add common functions needed Staging: add the go7007 video driver Staging: add me4000 pci data collection driver Staging: add me4000 firmware files Staging: add sxg network driver Staging: add Alacritech slicoss network driver ... Fixed up conflicts due to taint flags changes and MAINTAINERS cleanup in MAINTAINERS, include/linux/kernel.h and kernel/panic.c.	2008-10-17 09:50:12 -07:00
Arjan van de Ven	651dab4264	Merge commit 'linus/master' into merge-linus Conflicts: arch/x86/kvm/i8254.c	2008-10-17 09:20:26 -07:00
Thomas Gleixner	fb02fbc14d	NOHZ: restart tick device from irq_enter() We did not restart the tick device from irq_enter() to avoid double reprogramming and extra events in the return immediate to idle case. But long lasting softirqs can lead to a situation where jiffies become stale: idle() tick stopped (reprogrammed to next pending timer) halt() interrupt jiffies updated from irq_enter() interrupt handler softirq function 1 runs 20ms softirq function 2 arms a 10ms timer with a stale jiffies value jiffies updated from irq_exit() timer wheel has now an already expired timer (the one added in function 2) timer fires and timer softirq runs This was discovered when debugging a timer problem which happend only when the ath5k driver is active. The debugging proved that there is a softirq function running for more than 20ms, which is a bug by itself. To solve this we restart the tick timer right from irq_enter(), but do not go through the other functions which are necessary to return from idle when need_resched() is set. Reported-by: Elias Oltmanns <eo@nebensachen.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Elias Oltmanns <eo@nebensachen.de>	2008-10-17 18:13:38 +02:00
Thomas Gleixner	c34bec5a44	NOHZ: split tick_nohz_restart_sched_tick() Split out the clock event device reprogramming. Preparatory patch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-17 18:13:38 +02:00
Thomas Gleixner	719254faa1	NOHZ: unify the nohz function calls in irq_enter() We have two separate nohz function calls in irq_enter() for no good reason. Just call a single NOHZ function from irq_enter() and call the bits in the tick code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-17 18:13:38 +02:00
Mike Galbraith	b0aa51b999	sched: minor fast-path overhead reduction Greetings, 103638d added a bit of avoidable overhead to the fast-path. Use sysctl_sched_min_granularity instead of sched_slice() to restrict buddy wakeups. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-17 15:36:58 +02:00
Peter Zijlstra	b968905292	sched: fix the wrong mask_len, cleanup Clean up the division in show_schedstat(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-17 13:05:22 +02:00
Miao Xie	c851c8676b	sched: fix the wrong mask_len If NR_CPUS isn't a multiple of 32, we get a truncated string of sched domains by catting /proc/schedstat. This is caused by the wrong mask_len. This patch fixes it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-17 12:26:33 +02:00
Ingo Molnar	0f1f6dec95	Merge branch 'linus' into sched/urgent	2008-10-17 12:25:43 +02:00
David S. Miller	54514a70ad	softirq: Add support for triggering softirq work on softirqs. This is basically a genericization of Jens Axboe's block layer remote softirq changes. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-17 08:46:56 +02:00
Linus Torvalds	8cde1ad668	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched_clock: prevent scd->clock from moving backwards	2008-10-16 15:38:48 -07:00
Linus Torvalds	1c95e1b690	Fix kernel/softirq.c printk format warning properly This fixes the broken `77af7e3403` ("softirq, warning fix: correct a format to avoid a warning") fix correctly. The type of a pointer subtraction is not "int", nor is it "long". It can be either (or something else). It's "ptrdiff_t", and the printk format for it is "%td". Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 15:32:46 -07:00
Linus Torvalds	e533b22705	Merge branch 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails softirq, warning fix: correct a format to avoid a warning softirqs, debug: preemption check x86, pci-hotplug, calgary / rio: fix EBDA ioremap() IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description dmi scan: warn about too early calls to dmi_check_system() generic: redefine resource_size_t as phys_addr_t generic: make PFN_PHYS explicitly return phys_addr_t generic: add phys_addr_t for holding physical addresses softirq: allocate less vectors IO resources: fix/remove printk printk: robustify printk, update comment printk: robustify printk, fix #2 printk: robustify printk, fix printk: robustify printk Fixed up conflicts in: arch/powerpc/include/asm/types.h arch/powerpc/platforms/Kconfig.cputype manually.	2008-10-16 15:17:40 -07:00
Linus Torvalds	c813b4e16e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (46 commits) UIO: Fix mapping of logical and virtual memory UIO: add automata sercos3 pci card support UIO: Change driver name of uio_pdrv UIO: Add alignment warnings for uio-mem Driver core: add bus_sort_breadthfirst() function NET: convert the phy_device file to use bus_find_device_by_name kobject: Cleanup kobject_rename and !CONFIG_SYSFS kobject: Fix kobject_rename and !CONFIG_SYSFS sysfs: Make dir and name args to sysfs_notify() const platform: add new device registration helper sysfs: use ilookup5() instead of ilookup5_nowait() PNP: create device attributes via default device attributes Driver core: make bus_find_device_by_name() more robust usb: turn dev_warn+WARN_ON combos into dev_WARN debug: use dev_WARN() rather than WARN_ON() in device_pm_add() debug: Introduce a dev_WARN() function sysfs: fix deadlock device model: Do a quickcheck for driver binding before doing an expensive check Driver core: Fix cleanup in device_create_vargs(). Driver core: Clarify device cleanup. ...	2008-10-16 12:40:26 -07:00
Linus Torvalds	c8d8a2321f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: remove CONFIG_KMOD in comment after #endif remove CONFIG_KMOD from fs remove CONFIG_KMOD from drivers Manually fix conflict due to include cleanups in drivers/md/md.c	2008-10-16 12:38:34 -07:00
Adrian Bunk	2b252c5411	make kprobes.c:kretprobe_table_lock() static Make the needlessly global kretprobe_table_lock() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:52 -07:00
Bjorn Helgaas	c26ec88ea8	resources: tidy __request_region() No functional change. Just return NULL for kzalloc failure immediately, rather than wrapping the whole function body in the body of an "if". Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:51 -07:00
Thomas Petazzoni	ebf3f09c63	Configure out AIO support This patchs adds the CONFIG_AIO option which allows to remove support for asynchronous I/O operations, that are not necessarly used by applications, particularly on embedded devices. As this is a size-reduction option, it depends on CONFIG_EMBEDDED. It allows to save ~7 kilobytes of kernel code/data: text data bss dec hex filename 1115067 119180 217088 1451335 162547 vmlinux 1108025 119048 217088 1444161 160941 vmlinux.new -7042 -132 0 -7174 -1C06 +/- This patch has been originally written by Matt Mackall <mpm@selenic.com>, and is part of the Linux Tiny project. [randy.dunlap@oracle.com: build fix] Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:51 -07:00
Alexey Dobriyan	f221e726bf	sysctl: simplify ->strategy name and nlen parameters passed to ->strategy hook are unused, remove them. In general ->strategy hook should know what it's doing, and don't do something tricky for which, say, pointer to original userspace array may be needed (name). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> [ networking bits ] Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:47 -07:00
Christoph Hellwig	b418da16dd	compat: generic compat get/settimeofday Nothing arch specific in get/settimeofday. The details of the timeval conversion varied a little from arch to arch, but all with the same results. Also add an extern declaration for sys_tz to linux/time.h because externs in .c files are fowned upon. I'll kill the externs in various other files in a sparate patch. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> [ sparc bits ] Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:33 -07:00
Andi Kleen	20036fdcaf	Add kerneldoc documentation for new printk format extensions Add documentation in kerneldoc for new printk format extensions This patch documents the new %pS/%pF options in printk in kernel doc. Hope I didn't miss any other extension. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Randy Dunlap	6d5cd6effe	taint: fix kernel-doc Move print_tainted() kernel-doc to avoid the following error: Error(/var/linsrc/mmotm-2008-1002-1617//kernel/panic.c:155): cannot understand prototype: 'struct tnt ' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Francois Cami	e1f8e87449	Remove Andrew Morton's old email accounts People can use the real name an an index into MAINTAINERS to find the current email address. Signed-off-by: Francois Cami <francois.cami@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
WANG Cong	7968b3d9a6	kernel/kallsyms.c: fix double return Commit `6dd06c9fbe` ("module: make module_address_lookup safe") introduced double returns in the function kallsyms_lookup(), it's weird. The second one should be removed. Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Andrew Morton	9679e4dd62	kernel/sys.c: improve code generation utsname() is quite expensive to calculate. Cache it in a local. text data bss dec hex filename before: 11136 720 16 11872 2e60 kernel/sys.o after: 11096 720 16 11832 2e38 kernel/sys.o Acked-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Vegard Nossum	8798881507	utsname: completely overwrite prior information On sethostname() and setdomainname(), previous information may be retained if it was longer than than the new hostname/domainname. This can be demonstrated trivially by calling sethostname() first with a long name, then with a short name, and then calling uname() to retrieve the full buffer that contains the hostname (and possibly parts of the old hostname), one just has to look past the terminating zero. I don't know if we should really care that much (hence the RFC); the only scenarios I can possibly think of is administrator putting something sensitive in the hostname (or domain name) by accident, and changing it back will not undo the mistake entirely, though it's not like we can recover gracefully from "rm -rf /" either... The other scenario is namespaces (CLONE_NEWUTS) where some information may be unintentionally "inherited" from the previous namespace (a program wants to hide the original name and does clone + sethostname, but some information is still left). I think the patch may be defended on grounds of the principle of least surprise. But I am not adamant :-) (I guess the question now is whether userspace should be able to write embedded NULs into the buffer or not...) At least the observation has been made and the patch has been presented. Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Dave Hansen	22b8ce9470	profiling: dynamically enable readprofile at runtime Way too often, I have a machine that exhibits some kind of crappy behavior. The CPU looks wedged in the kernel or it is spending way too much system time and I wonder what is responsible. I try to run readprofile. But, of course, Ubuntu doesn't enable it by default. Dang! The reason we boot-time enable it is that it takes a big bufffer that we generally can only bootmem alloc. But, does it hurt to at least try and runtime-alloc it? To use: echo 2 > /sys/kernel/profile Then run readprofile like normal. This should fix the compile issue with allmodconfig. I've compile-tested on a bunch more configs now including a few more architectures. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Adam Tkac	0c2d64fb6c	rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY When a process wants to set the limit of open files to RLIM_INFINITY it gets EPERM even if it has CAP_SYS_RESOURCE capability. For example, BIND does: ... #elif defined(NR_OPEN) && defined(__linux__) /* * Some Linux kernels don't accept RLIM_INFINIT; the maximum * possible value is the NR_OPEN defined in linux/fs.h. */ if (resource == isc_resource_openfiles && rlim_value == RLIM_INFINITY) { rl.rlim_cur = rl.rlim_max = NR_OPEN; unixresult = setrlimit(unixresource, &rl); if (unixresult == 0) return (ISC_R_SUCCESS); } #elif ... If we allow setting RLIMIT_NOFILE to RLIM_INFINITY we increase portability - you don't have to check if OS is linux and then use different schema for limits. The spec says "Specifying RLIM_INFINITY as any resource limit value on a successful call to setrlimit() shall inhibit enforcement of that resource limit." and we're presently not doing that. Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Andi Kleen	25ddbb18aa	Make the taint flags reliable It's somewhat unlikely that it happens, but right now a race window between interrupts or machine checks or oopses could corrupt the tainted bitmap because it is modified in a non atomic fashion. Convert the taint variable to an unsigned long and use only atomic bit operations on it. Unfortunately this means the intvec sysctl functions cannot be used on it anymore. It turned out the taint sysctl handler could actually be simplified a bit (since it only increases capabilities) so this patch actually removes code. [akpm@linux-foundation.org: remove unneeded include] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Jan Beulich	9ba16087d9	Kconfig: eliminate "def_bool n" constructs Using "def_bool n" is pointless, simply using bool here appears more appropriate. Further, retaining such options that don't have a prompt and aren't selected by anything seems also at least questionable. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Tejun Heo	a25d644fc0	wait: kill is_sync_wait() is_sync_wait() is used to distinguish between sync and async waits. Basically sync waits are the ones initialized with init_waitqueue_entry() and async ones with init_waitqueue_func_entry(). The sync/async distinction is used only in prepare_to_wait[_exclusive]() and its only function is to skip setting the current task state if the wait is async. This has a few problems. * No one uses it. None of func_entry users use prepare_to_wait() functions, so the code path never gets executed. * The distinction is bogus. Maybe back when func_entry is used only by aio but it's now also used by epoll and in future possibly by 9p and poll/select. * Taking @state as argument and ignoring it silenly depending on how @wait is initialized is just a bad error-prone API. * It prevents func_entry waits from using wait->private for no good reason. This patch kills is_sync_wait() and the associated code paths from prepare_to_wait[_exclusive](). As there was no user of these code paths, this patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Adrian Bunk	d9f3216b47	kernel/dma.c: remove a CVS keyword Remove a CVS keyword that wasn't updated for a long time from a comment. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:30 -07:00
Rafael J. Wysocki	1bfcf1304e	pm: rework disabling of user mode helpers during suspend/hibernation We currently use a PM notifier to disable user mode helpers before suspend and hibernation and to re-enable them during resume. However, this is not an ideal solution, because if any drivers want to upload firmware into memory before suspend, they have to use a PM notifier for this purpose and there is no guarantee that the ordering of PM notifiers will be as expected (ie. the notifier that disables user mode helpers has to be run after the driver's notifier used for uploading the firmware). For this reason, it seems better to move the disabling and enabling of user mode helpers to separate functions that will be called by the PM core as necessary. [akpm@linux-foundation.org: remove unneeded ifdefs] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Alan Stern <stern@rowland.harvard.edu> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:29 -07:00
Balbir Singh	9363b9f23c	memrlimit: cgroup mm owner callback changes to add task info This patch adds an additional field to the mm_owner callbacks. This field is required to get to the mm that changed. Hold mmap_sem in write mode before calling the mm_owner_changed callback [hugh@veritas.com: fix mmap_sem deadlock] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:28 -07:00
Jason Baron	346e15beb5	driver core: basic infrastructure for per-module dynamic debug messages Base infrastructure to enable per-module debug messages. I've introduced CONFIG_DYNAMIC_PRINTK_DEBUG, which when enabled centralizes control of debugging statements on a per-module basis in one /proc file, currently, <debugfs>/dynamic_printk/modules. When, CONFIG_DYNAMIC_PRINTK_DEBUG, is not set, debugging statements can still be enabled as before, often by defining 'DEBUG' for the proper compilation unit. Thus, this patch set has no affect when CONFIG_DYNAMIC_PRINTK_DEBUG is not set. The infrastructure currently ties into all pr_debug() and dev_dbg() calls. That is, if CONFIG_DYNAMIC_PRINTK_DEBUG is set, all pr_debug() and dev_dbg() calls can be dynamically enabled/disabled on a per-module basis. Future plans include extending this functionality to subsystems, that define their own debug levels and flags. Usage: Dynamic debugging is controlled by the debugfs file, <debugfs>/dynamic_printk/modules. This file contains a list of the modules that can be enabled. The format of the file is as follows: <module_name> <enabled=0/1> . . . <module_name> : Name of the module in which the debug call resides <enabled=0/1> : whether the messages are enabled or not For example: snd_hda_intel enabled=0 fixup enabled=1 driver enabled=0 Enable a module: $echo "set enabled=1 <module_name>" > dynamic_printk/modules Disable a module: $echo "set enabled=0 <module_name>" > dynamic_printk/modules Enable all modules: $echo "set enabled=1 all" > dynamic_printk/modules Disable all modules: $echo "set enabled=0 all" > dynamic_printk/modules Finally, passing "dynamic_printk" at the command line enables debugging for all modules. This mode can be turned off via the above disable command. [gkh: minor cleanups and tweaks to make the build work quietly] Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:47 -07:00
Alexey Dobriyan	e94320939f	modules: fix module "notes" kobject leak Fix "notes" kobject leak It happens every rmmod if KALLSYMS=y and SYSFS=y. # modprobe foo kobject: 'foo' (ffffffffa00743d0): kobject_add_internal: parent: 'module', set: 'module' kobject: 'holders' (ffff88017e7c5770): kobject_add_internal: parent: 'foo', set: '<NULL>' kobject: 'foo' (ffffffffa00743d0): kobject_uevent_env kobject: 'foo' (ffffffffa00743d0): fill_kobj_path: path = '/module/foo' kobject: 'notes' (ffff88017fa9b668): kobject_add_internal: parent: 'foo', set: '<NULL>' ^^^^^ # rmmod foo kobject: 'holders' (ffff88017e7c5770): kobject_cleanup kobject: 'holders' (ffff88017e7c5770): auto cleanup kobject_del kobject: 'holders' (ffff88017e7c5770): calling ktype release kobject: (ffff88017e7c5770): dynamic_kobj_release kobject: 'holders': free name kobject: 'foo' (ffffffffa00743d0): kobject_cleanup kobject: 'foo' (ffffffffa00743d0): does not have a release() function, it is broken and must be fixed. kobject: 'foo' (ffffffffa00743d0): auto cleanup 'remove' event kobject: 'foo' (ffffffffa00743d0): kobject_uevent_env kobject: 'foo' (ffffffffa00743d0): fill_kobj_path: path = '/module/foo' kobject: 'foo' (ffffffffa00743d0): auto cleanup kobject_del kobject: 'foo': free name [whooops] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:41 -07:00
Rusty Russell	118a9069f0	module: remove CONFIG_KMOD in comment after #endif Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-17 02:38:37 +11:00
Thomas Gleixner	63d659d556	genirq: fix name space collision of nr_irqs in autoprobe.c probe_irq_off() is disfunctional as the local nr_irqs is referenced instead of the global one for the for_each_irq_desc() iterator. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:30 +02:00
Thomas Gleixner	10e580842e	genirq: use iterators for irq_desc loops Use for_each_irq_desc[_reverse] for all the iteration loops. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:30 +02:00
Thomas Gleixner	d3c60047bd	genirq: cleanup the sparseirq modifications Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:29 +02:00
Thomas Gleixner	d6c88a507e	genirq: revert dynarray Revert the dynarray changes. They need more thought and polishing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:15 +02:00
Thomas Gleixner	ee32c97322	genirq: remove irq_to_desc_alloc Remove the leftover of sparseirqs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:15 +02:00
Thomas Gleixner	2cc21ef843	genirq: remove sparse irq code This code is not ready, but we need to rip it out instead of rebasing as we would lose the APIC/IO_APIC unification otherwise. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:15 +02:00
Thomas Gleixner	c6b7674f32	genirq: use inline function for irq_to_desc For the non sparse irq case an inline function is perfectly fine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:14 +02:00
Yinghai Lu	aac3f2b6f6	x86: fix typo in irq_desc array when SPARSE_IRQ is not used, should still use irq_desc->lock Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:11 +02:00
Andrew Morton	2976fe2012	fix warning: "x86: sparse_irq needs spin_lock in allocations" caused by commit a532e19680ada3b8579b81e67e76d3ebd19c340f Author: Yinghai Lu <yhlu.kernel@gmail.com> Date: Wed Aug 20 20:46:25 2008 -0700 x86: sparse_irq needs spin_lock in allocations Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:11 +02:00
Yinghai Lu	9d98598d2f	sparseirq: remove some debug print out Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:11 +02:00
Yinghai Lu	e00585bb7f	irq: fix irqpoll && sparseirq Steven Noonan reported a boot hang when using irqpoll and CONFIG_HAVE_SPARSE_IRQ=y. The irqpoll loop needs to be updated to not iterate from 1 to nr_irqs but to iterate via for_each_irq_desc(). (in the former case desc can be NULL which crashes the box) Reported-by: Steven Noonan <steven@uplinklabs.net> Tested-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:10 +02:00
venkatesh.pallipadi@intel.com	932775a4ab	x86: HPET_MSI change IRQ affinity in process context when it is disabled Change the IRQ affinity in the process context when the IRQ is disabled. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:07 +02:00
Dean Nelson	21056830c4	irq: set_irq_chip() has redundant call to irq_to_desc() Extraneous call to irq_to_desc(). Signed-off-by: Dean Nelson <dcn@sgi.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:07 +02:00
Yinghai Lu	8c464a4b23	sparseirq: move kstat_irqs from kstat to irq_desc - fix fix non-sparseirq architectures. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:04 +02:00
Yinghai Lu	e89eb43863	x86: sparse_irq needs spin_lock in allocations Suresh Siddha noticed that we should have a spinlock around it. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:59 +02:00
Ingo Molnar	e955b5398b	sparseirq: fix lockdep -tip testing found this lockdep splat: [ 0.000000] Initializing CPU#0 [ 0.000000] found new irq_desc for irq 0 [ 0.000000] INFO: trying to register non-static key. [ 0.000000] the code is fine but needs lockdep annotation. [ 0.000000] turning off the locking correctness validator. [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.27-rc3-tip-00191-g98ccb89-dirty #1 [ 0.000000] [<c0153c22>] register_lock_class+0x3d2/0x400 [ 0.000000] [<c0104d87>] ? mcount_call+0x5/0xa [ 0.000000] [<c0154f3a>] __lock_acquire+0x22a/0x5d0 [ 0.000000] [<c0104d87>] ? mcount_call+0x5/0xa [ 0.000000] [<c0155351>] lock_acquire+0x71/0xa0 [ 0.000000] [<c016d61f>] ? set_irq_chip+0x3f/0x90 [ 0.000000] [<c070f148>] _spin_lock_irqsave+0x58/0x90 [ 0.000000] [<c016d61f>] ? set_irq_chip+0x3f/0x90 [ 0.000000] [<c016d61f>] set_irq_chip+0x3f/0x90 [ 0.000000] [<c016d7e0>] ? handle_level_irq+0x0/0xe0 [ 0.000000] [<c016da1a>] set_irq_chip_and_handler_name+0x1a/0x40 [ 0.000000] [<c0a396c1>] init_ISA_irqs+0x51/0xa0 [ 0.000000] [<c0a4a365>] pre_intr_init_hook+0x25/0x30 [ 0.000000] [<c0a39723>] native_init_IRQ+0x13/0x370 [ 0.000000] [<c015569c>] ? lock_release+0xcc/0x1d0 [ 0.000000] [<c0104d87>] ? mcount_call+0x5/0xa [ 0.000000] [<c070dc22>] ? __mutex_unlock_slowpath+0x92/0x110 [ 0.000000] [<c070dcad>] ? mutex_unlock+0xd/0x10 [ 0.000000] [<c0135f62>] ? cpu_maps_update_done+0x12/0x20 [ 0.000000] [<c06c6743>] ? register_cpu_notifier+0x23/0x30 [ 0.000000] [<c011e8ae>] init_IRQ+0xe/0x10 [ 0.000000] [<c0a357a5>] start_kernel+0x1c5/0x340 [ 0.000000] [<c0a35280>] ? unknown_bootoption+0x0/0x210 [ 0.000000] [<c0a3506b>] i386_start_kernel+0x6b/0x80 [ 0.000000] ======================= [ 0.000000] found new irq_desc for irq 1 [ 0.000000] found new irq_desc for irq 2 [ 0.000000] found new irq_desc for irq 3 this: static void init_one_irq_desc(struct irq_desc *desc) { memcpy(desc, &irq_desc_init, sizeof(struct irq_desc)); #ifdef CONFIG_TRACE_IRQFLAGS lockdep_set_class(&desc->lock, &irq_desc_lock_class); #endif } should be unconditional. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:54 +02:00
Yinghai Lu	8b8e8c1bf7	x86: remove irqbalance in kernel for 32 bit This has been deprecated for years, the user space irqbalanced utility works better with numa, has configurable policies, etc... Signed-off-by: Yinghai Lu <yhlu.kernel@gmai.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:52 +02:00
Yinghai Lu	67fb283e14	irq: separate sparse_irqs from sparse_irqs_free so later don't need compare with -1U Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:51 +02:00
Yinghai Lu	cb5bc83225	x86_64: rename irq_desc/irq_desc_alloc change names: irq_desc() ==> irq_desc_alloc __irq_desc() ==> irq_desc Also split a few of the uses in lowlevel x86 code. v2: need to check if desc is null in smp_irq_move_cleanup Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:51 +02:00
Yinghai Lu	46926b67fc	generic: add irq_desc in function in parameter So we could remove some duplicated calling to irq_desc v2: make sure irq_desc in init/main.c is not used without generic_hardirqs Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:50 +02:00
Yinghai Lu	7d94f7ca40	irq: remove >= nr_irqs checking with config_have_sparse_irq remove irq limit checks - nr_irqs is dynamic and we expand anytime. v2: fix checking about result irq_cfg_without_new, so could use msi again v3: use irq_desc_without_new to check irq is valid Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:50 +02:00
Yinghai Lu	2c6927a38f	irq: replace loop with nr_irqs with for_each_irq_desc There are a handful of loops that go from 0 to nr_irqs and use get_irq_desc() on them. These would allocate all the irq_desc entries, regardless of the need for them. Use the smarter for_each_irq_desc() iterator that will only iterate over the present ones. v2: make sure arch without GENERIC_HARDIRQS work too Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:33 +02:00
Yinghai Lu	9059d8fa4a	irq: add irq_desc_without_new add an irq_desc accessor that will not allocate any sparse entry but returns failure if there's no entry present. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:32 +02:00
Yinghai Lu	7f95ec9e4c	x86: move kstat_irqs from kstat to irq_desc based on Eric's patch ... together mold it with dyn_array for irq_desc, will allcate kstat_irqs for nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already. v2: make sure system without generic_hardirqs works they don't have irq_desc v3: fix merging v4: [mingo@elte.hu] fix typo [ mingo@elte.hu ] irq: build fix fix: arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow': arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs' Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:32 +02:00
Ingo Molnar	3bf52a4df3	irq: sparse irqs, fix IRQ auto-probe crash fix: [ 10.631533] calling yenta_socket_init+0x0/0x20 [ 10.631533] Yenta: CardBus bridge found at 0000:15:00.0 [17aa:2012] [ 10.631533] Yenta: Using INTVAL to route CSC interrupts to PCI [ 10.631533] Yenta: Routing CardBus interrupts to PCI [ 10.631533] Yenta TI: socket 0000:15:00.0, mfunc 0x01d01002, devctl 0x64 [ 10.731599] BUG: unable to handle kernel NULL pointer dereference at 00000040 [ 10.731838] IP: [<c0c95b5f>] _spin_lock_irq+0xf/0x20 [ 10.732221] *pde = 00000000 [ 10.732741] Oops: 0002 [#1] SMP [ 10.733453] [ 10.734253] Pid: 1, comm: swapper Tainted: G W (2.6.27-rc3-tip-00173-gd7eaa4f-dirty #1) [ 10.735188] EIP: 0060:[<c0c95b5f>] EFLAGS: 00010002 CPU: 0 [ 10.735523] EIP is at _spin_lock_irq+0xf/0x20 [ 10.735523] EAX: 00000040 EBX: 00000000 ECX: f6e04c90 EDX: 00000100 [ 10.735523] ESI: 000000df EDI: f6e04c90 EBP: f7867df0 ESP: f7867df0 [ 10.735523] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 10.735523] Process swapper (pid: 1, ti=f7867000 task=f7870000 task.ti=f7867000) [ 10.735523] Stack: f7867e04 c0155fbd 00000000 00000000 f6e04c90 f7867e5c c0c6e319 c0f6a074 [ 10.735523] f6e04c90 000017aa 00002012 c112b648 f791f240 c112b5e0 f7867e44 c010440b [ 10.735523] f791f240 f791f29c c112b8ec f791f240 00000000 f7867e5c c048f893 03c0b648 [ 10.735523] Call Trace: [ 10.735523] [<c0155fbd>] ? probe_irq_on+0x3d/0x140 [ 10.735523] [<c0c6e319>] ? yenta_probe+0x529/0x640 [ 10.735523] [<c010440b>] ? mcount_call+0x5/0xa [ 10.735523] [<c048f893>] ? pci_match_device+0xa3/0xb0 [ 10.735523] [<c048fc1e>] ? pci_device_probe+0x5e/0x80 [ 10.735523] [<c0515423>] ? driver_probe_device+0x83/0x180 [ 10.735523] [<c0515594>] ? __driver_attach+0x74/0x80 [ 10.735523] [<c0514b69>] ? bus_for_each_dev+0x49/0x70 [ 10.735523] [<c051528e>] ? driver_attach+0x1e/0x20 [ 10.735523] [<c0515520>] ? __driver_attach+0x0/0x80 [ 10.735523] [<c05150d3>] ? bus_add_driver+0x1a3/0x220 [ 10.735523] [<c048fb60>] ? pci_device_remove+0x0/0x40 [ 10.735523] [<c05157f4>] ? driver_register+0x54/0x130 [ 10.735523] [<c048fe2f>] ? __pci_register_driver+0x4f/0x90 [ 10.735523] [<c11e9419>] ? yenta_socket_init+0x19/0x20 [ 10.735523] [<c0101125>] ? do_one_initcall+0x35/0x160 [ 10.735523] [<c11e9400>] ? yenta_socket_init+0x0/0x20 [ 10.735523] [<c01391a6>] ? __queue_work+0x36/0x50 [ 10.735523] [<c013922d>] ? queue_work_on+0x3d/0x50 [ 10.735523] [<c11a2758>] ? kernel_init+0x148/0x210 [ 10.735523] [<c11a2610>] ? kernel_init+0x0/0x210 [ 10.735523] [<c01043f3>] ? kernel_thread_helper+0x7/0x10 [ 10.735523] ======================= [ 10.735523] Code: 10 38 f2 74 06 f3 90 8a 10 eb f6 5d 89 c8 c3 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 89 e5 e8 a4 e8 46 ff fa ba 00 01 00 00 90 <66> 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 5d c3 90 55 89 e5 53 as auto-probing wants to iterate over existing irqs. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:30 +02:00
Yinghai Lu	08678b0841	generic: sparse irqs: use irq_desc() together with dyn_array, instead of irq_desc[] add CONFIG_HAVE_SPARSE_IRQ to for use condensed array. Get rid of irq_desc[] array assumptions. Preallocate 32 irq_desc, and irq_desc() will try to get more. ( No change in functionality is expected anywhere, except the odd build failure where we missed a code site or where a crossing commit itroduces new irq_desc[] usage. ) v2: according to Eric, change get_irq_desc() to irq_desc() Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:29 +02:00
Yinghai Lu	d17a55ded3	irq: make irqs in kernel stat use per_cpu_dyn_array Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:08 +02:00
Ingo Molnar	fa42d10dd5	irq: sparse irqs, export nr_irqs fix: Building modules, stage 2. MODPOST 458 modules ERROR: "nr_irqs" [drivers/serial/8250.ko] undefined! Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:08 +02:00
Yinghai Lu	d60458b224	irq: make irq_desc to use dyn_array Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:07 +02:00
Yinghai Lu	85c0f90978	irq: introduce nr_irqs at this point nr_irqs is equal NR_IRQS convert a few easy users from NR_IRQS to dynamic nr_irqs. v2: according to Eric, we need to take care of arch without generic_hardirqs Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:05 +02:00
Ingo Molnar	5fef06e8c8	Merge branch 'linus' into genirq	2008-10-16 16:51:32 +02:00
Peter Zijlstra	8cd162ce23	sched: only update rq->clock while holding rq->lock Vatsa noticed rq->clock going funny and tracked it down to an update_rq_clock() outside a rq->lock section. This is a problem because things like double_rq_lock() update the rq->clock value for both rqs. Therefore disabling interrupts isn't strong enough. Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-15 20:43:27 +02:00
Ingo Molnar	6b2ada8210	Merge branches 'core/softlockup', 'core/softirq', 'core/resources', 'core/printk' and 'core/misc' into core-v28-for-linus	2008-10-15 12:48:44 +02:00
Linus Torvalds	8acd3a60bc	Merge branch 'for-2.6.28' of git://linux-nfs.org/~bfields/linux * 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits) svcrdma: Fix IRD/ORD polarity svcrdma: Update svc_rdma_send_error to use DMA LKEY svcrdma: Modify the RPC reply path to use FRMR when available svcrdma: Modify the RPC recv path to use FRMR when available svcrdma: Add support to svc_rdma_send to handle chained WR svcrdma: Modify post recv path to use local dma key svcrdma: Add a service to register a Fast Reg MR with the device svcrdma: Query device for Fast Reg support during connection setup svcrdma: Add FRMR get/put services NLM: Remove unused argument from svc_addsock() function NLM: Remove "proto" argument from lockd_up() NLM: Always start both UDP and TCP listeners lockd: Remove unused fields in the nlm_reboot structure lockd: Add helper to sanity check incoming NOTIFY requests lockd: change nlmclnt_grant() to take a "struct sockaddr *" lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET lockd: Support non-AF_INET addresses in nlm_lookup_host() NLM: Convert nlm_lookup_host() to use a single argument svcrdma: Add Fast Reg MR Data Types ...	2008-10-14 12:31:14 -07:00
Ingo Molnar	98d9c66ab0	tracing/fastboot: improve help text Improve the help text of the boot tracer. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 14:27:20 +02:00
Ingo Molnar	4519d9e54d	tracing/stacktrace: improve help text Improve the help text that is displayed for CONFIG_STACK_TRACER. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 14:15:43 +02:00
Harvey Harrison	ad0a3b6811	trace: add build-time check to avoid overrunning hex buffer Remove the runtime BUG_ON and change to a compile-time check in the macro that calls the hex format routine [Noticed by Joe Perches] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:26 +02:00
Harvey Harrison	2fbc474901	ftrace: fix hex output mode of ftrace Fix the output of ftrace in hex mode as the hi/lo nibbles are output in reverse order. Without this patch, the output of ftrace is: raw mode : 6474 0 141531612444 0 140 + 6402 120 S hex mode : 000091a4 00000000 000000023f1f50c1 00000000 c8 000000b2 00009120 87 ffff00c8 00000035 There is an inversion on ouput hex(6474) is 194a [based on a patch by Philippe Reynes <tremyfr@yahoo.fr>] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:25 +02:00
Arjan van de Ven	8a5d900cca	tracing/fastboot: fix printk format typo in boot tracer When printing nanoseconds, the right printk format string is %09 not %06... Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Frédéric Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:23 +02:00
Frederic Weisbecker	c2931e05ec	ftrace: return an error when setting a nonexistent tracer When one try to set a nonexistent tracer, no error is returned as if the name of the tracer was correct. We should return -EINVAL. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:22 +02:00
Steven Rostedt	3ea2e6d71a	ftrace: make some tracers reentrant Now that the ring buffer is reentrant, some of the ftrace tracers (sched_swich, debugging traces) can also be reentrant. Note: Never make the function tracer reentrant, that can cause recursion problems all over the kernel. The function tracer must disable reentrancy. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:20 +02:00
Steven Rostedt	bf41a158ca	ring-buffer: make reentrant This patch replaces the local_irq_save/restore with preempt_disable/ enable. This allows for interrupts to enter while recording. To write to the ring buffer, you must reserve data, and then commit it. During this time, an interrupt may call a trace function that will also record into the buffer before the commit is made. The interrupt will reserve its entry after the first entry, even though the first entry did not finish yet. The time stamp delta of the interrupt entry will be zero, since in the view of the trace, the interrupt happened during the first field anyway. Locking still takes place when the tail/write moves from one page to the next. The reader always takes the locks. A new page pointer is added, called the commit. The write/tail will always point to the end of all entries. The commit field will point to the last committed entry. Only this commit entry may update the write time stamp. The reader can only go up to the commit. It cannot go past it. If a lot of interrupts come in during a commit that fills up the buffer, and it happens to make it all the way around the buffer back to the commit, then a warning is printed and new events will be dropped. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:19 +02:00
Steven Rostedt	6f807acd27	ring-buffer: move page indexes into page headers Remove the global head and tail indexes and move them into the page header. Each page will now keep track of where the last write and read was made. We also rename the head and tail to read and write for better clarification. This patch is needed for future enhancements to move the ring buffer to a lockless solution. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:18 +02:00
Frederic Weisbecker	097d036a2f	tracing/fastboot: only trace non-module initcalls At this time, only built-in initcalls interest us. We can't really produce a relevant graph if we include the modules initcall too. I had good results after this patch (see svg in attachment). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:17 +02:00
Steven Rostedt	6450c1d321	ftrace: move pc counter in irqtrace The assigning of the pc counter is in the wrong spot in the check_critical_timing function. The pc variable is used in the out jump. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:16 +02:00
Steven Rostedt	aa1e0e3bcf	ring_buffer: map to cpu not page My original patch had a compile bug when NUMA was configured. I referenced cpu when it should have been cpu_buffer->cpu. Ingo quickly fixed this bug by replacing cpu with 'i' because that was the loop counter. Unfortunately, the 'i' was the counter of pages, not CPUs. This caused a crash when the number of pages allocated for the buffers exceeded the number of pages, which would usually be the case. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:15 +02:00
Frederic Weisbecker	5601020feb	tracing/fastboot: get the initcall name before it disappears After some initcall traces, some initcall names may be inconsistent. That's because these functions will disappear from the .init section and also their name from the symbols table. So we have to copy the name of the function in a buffer large enough during the trace appending. It is not costly for the ring_buffer because the number of initcall entries is commonly not really large. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:12 +02:00
Frederic Weisbecker	cb5ab74204	tracing/fastboot: change the printing of boot tracer according to bootgraph.pl Change the boot tracer printing to make it parsable for the scripts/bootgraph.pl script. We have now to output two lines for each initcall, according to the printk in do_one_initcall() in init/main.c We need now the call's time and the return's time. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:11 +02:00
Ingo Molnar	77ae11f63b	ring-buffer: fix build error fix: kernel/trace/ring_buffer.c: In function ‘rb_allocate_pages’: kernel/trace/ring_buffer.c:235: error: ‘cpu’ undeclared (first use in this function) kernel/trace/ring_buffer.c:235: error: (Each undeclared identifier is reported only once kernel/trace/ring_buffer.c:235: error: for each function it appears in.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:10 +02:00
Steven Rostedt	38697053fa	ftrace: preempt disable over interrupt disable With the new ring buffer infrastructure in ftrace, I'm trying to make ftrace a little more light weight. This patch converts a lot of the local_irq_save/restore into preempt_disable/enable. The original preempt count in a lot of cases has to be sent in as a parameter so that it can be recorded correctly. Some places were recording it incorrectly before anyway. This is also laying the ground work to make ftrace a little bit more reentrant, and remove all locking. The function tracers must still protect from reentrancy. Note: All the function tracers must be careful when using preempt_disable. It must do the following: resched = need_resched(); preempt_disable_notrace(); [...] if (resched) preempt_enable_no_resched_notrace(); else preempt_enable_notrace(); The reason is that if this function traces schedule() itself, the preempt_enable_notrace() will cause a schedule, which will lead us into a recursive failure. If we needed to reschedule before calling preempt_disable, we should have already scheduled. Since we did not, this is most likely that we should not and are probably inside a schedule function. If resched was not set, we still need to catch the need resched flag being set when preemption was off and the if case at the end will catch that for us. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:09 +02:00
Steven Rostedt	e4c2ce82ca	ring_buffer: allocate buffer page pointer The current method of overlaying the page frame as the buffer page pointer can be very dangerous and limits our ability to do other things with a page from the buffer, like send it off to disk. This patch allocates the buffer_page instead of overlaying the page's page frame. The use of the buffer_page has hardly changed due to this. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:08 +02:00
Steven Rostedt	7104f300c5	ftrace: type cast filter+verifier The mmiotrace map had a bug that would typecast the entry from the trace to the wrong type. That is a known danger of C typecasts, there's absolutely zero checking done on them. Help that problem a bit by using a GCC extension to implement a type filter that restricts the types that a trace record can be cast into, and by adding a dynamic check (in debug mode) to verify the type of the entry. This patch adds a macro to assign all entries of ftrace using the type of the variable and checking the entry id. The typecasts are now done in the macro for only those types that it knows about, which should be all the types that are allowed to be read from the tracer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:07 +02:00
Frederic Weisbecker	797d3712a9	tracing/ftrace: adapt mmiotrace to the new type of print_line, fix Correct the value's type of trace_empty function Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:06 +02:00
Steven Rostedt	d769041f86	ring_buffer: implement new locking The old "lock always" scheme had issues with lockdep, and was not very efficient anyways. This patch does a new design to be partially lockless on writes. Writes will add new entries to the per cpu pages by simply disabling interrupts. When a write needs to go to another page than it will grab the lock. A new "read page" has been added so that the reader can pull out a page from the ring buffer to read without worrying about the writer writing over it. This allows us to not take the lock for all reads. The lock is now only taken when a read needs to go to a new page. This is far from lockless, and interrupts still need to be disabled, but it is a step towards a more lockless solution, and it also solves a lot of the issues that were noticed by the first conversion of ftrace to the ring buffers. Note: the ring_buffer_{un}lock API has been removed. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:05 +02:00
Steven Rostedt	70255b5e3f	ring_buffer: remove raw from local_irq_save The raw_local_irq_save causes issues with lockdep. We don't need it so replace them with local_irq_save. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:04 +02:00
Frederic Weisbecker	9e9efffb78	tracing/ftrace: adapt the boot tracer to the new print_line type This patch adapts the boot tracer to the new type of the print_line callback. It still relays entries it doesn't support to default output functions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:03 +02:00
Frederic Weisbecker	07f4e4f790	tracing/ftrace: adapt mmiotrace to the new type of print_line Adapt mmiotrace to the new print_line type. By default, it ignores (and consumes) types it doesn't support. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:02 +02:00
Pekka Paalanen	9ff4b9744c	tracing/ftrace: fix pipe breaking This patch fixes a bug which break the pipe when the seq is empty. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:01 +02:00
Frederic Weisbecker	2c4f035f6c	tracing/ftrace: change the type of the print_line callback We need a kind of disambiguation when a print_line callback returns 0. _There is not enough space to print all the entry. Please flush the seq and retry. _I can't handle this type of entry This patch changes the type of this callback for better information. Also some changes have been made in this V2. _ Only relay to default functions after the print_line callback fails. _ This patch doesn't fix the issue with the broken pipe (see patch 2/4 for that) Some things are still in discussion: _ Find better names for the enum print_line_t values _ Change the type of print_trace_line into boolean. Patches to change that can be sent later. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:00 +02:00
Steven Rostedt	777e208d40	ftrace: take advantage of variable length entries Now that the underlining ring buffer for ftrace now hold variable length entries, we can take advantage of this by only storing the size of the actual event into the buffer. This happens to increase the number of entries in the buffer dramatically. We can also get rid of the "trace_cont" operation, but I'm keeping that until we have no more users. Some of the ftrace tracers can now change their code to adapt to this new feature. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:59 +02:00
Steven Rostedt	3928a8a2d9	ftrace: make work with new ring buffer This patch ports ftrace over to the new ring buffer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:57 +02:00
Steven Rostedt	ed56829cb3	ring_buffer: reset buffer page when freeing Mathieu Desnoyers pointed out that the freeing of the page frame needs to be reset otherwise we might trigger BUG_ON in the page free code. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:56 +02:00
Steven Rostedt	a7b1374333	ring_buffer: add paranoid check for buffer page If for some strange reason the buffer_page gets bigger, or the page struct gets smaller, I want to know this ASAP. The best way is to not let the kernel compile. This patch adds code to test the size of the struct buffer_page against the page struct and will cause compile issues if the buffer_page ever gets bigger than the page struct. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:55 +02:00
Steven Rostedt	7a8e76a382	tracing: unified trace buffer This is a unified tracing buffer that implements a ring buffer that hopefully everyone will eventually be able to use. The events recorded into the buffer have the following structure: struct ring_buffer_event { u32 type:2, len:3, time_delta:27; u32 array[]; }; The minimum size of an event is 8 bytes. All events are 4 byte aligned inside the buffer. There are 4 types (all internal use for the ring buffer, only the data type is exported to the interface users). RINGBUF_TYPE_PADDING: this type is used to note extra space at the end of a buffer page. RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events is greater than the 27 bit delta can hold. We add another 32 bits, and record that in its own event (8 byte size). RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to help keep the buffer timestamps in sync. RINGBUF_TYPE_DATA: The event actually holds user data. The "len" field is only three bits. Since the data must be 4 byte aligned, this field is shifted left by 2, giving a max length of 28 bytes. If the data load is greater than 28 bytes, the first array field holds the full length of the data load and the len field is set to zero. Example, data size of 7 bytes: type = RINGBUF_TYPE_DATA len = 2 time_delta: <time-stamp> - <prev_event-time-stamp> array[0..1]: <7 bytes of data> <1 byte empty> This event is saved in 12 bytes of the buffer. An event with 82 bytes of data: type = RINGBUF_TYPE_DATA len = 0 time_delta: <time-stamp> - <prev_event-time-stamp> array[0]: 84 (Note the alignment) array[1..14]: <82 bytes of data> <2 bytes empty> The above event is saved in 92 bytes (if my math is correct). 82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length. Do not reference the above event struct directly. Use the following functions to gain access to the event table, since the ring_buffer_event structure may change in the future. ring_buffer_event_length(event): get the length of the event. This is the size of the memory used to record this event, and not the size of the data pay load. ring_buffer_time_delta(event): get the time delta of the event This returns the delta time stamp since the last event. Note: Even though this is in the header, there should be no reason to access this directly, accept for debugging. ring_buffer_event_data(event): get the data from the event This is the function to use to get the actual data from the event. Note, it is only a pointer to the data inside the buffer. This data must be copied to another location otherwise you risk it being written over in the buffer. ring_buffer_lock: A way to lock the entire buffer. ring_buffer_unlock: unlock the buffer. ring_buffer_alloc: create a new ring buffer. Can choose between overwrite or consumer/producer mode. Overwrite will overwrite old data, where as consumer producer will throw away new data if the consumer catches up with the producer. The consumer/producer is the default. ring_buffer_free: free the ring buffer. ring_buffer_resize: resize the buffer. Changes the size of each cpu buffer. Note, it is up to the caller to provide that the buffer is not being used while this is happening. This requirement may go away but do not count on it. ring_buffer_lock_reserve: locks the ring buffer and allocates an entry on the buffer to write to. ring_buffer_unlock_commit: unlocks the ring buffer and commits it to the buffer. ring_buffer_write: writes some data into the ring buffer. ring_buffer_peek: Look at a next item in the cpu buffer. ring_buffer_consume: get the next item in the cpu buffer and consume it. That is, this function increments the head pointer. ring_buffer_read_start: Start an iterator of a cpu buffer. For now, this disables the cpu buffer, until you issue a finish. This is just because we do not want the iterator to be overwritten. This restriction may change in the future. But note, this is used for static reading of a buffer which is usually done "after" a trace. Live readings would want to use the ring_buffer_consume above, which will not disable the ring buffer. ring_buffer_read_finish: Finishes the read iterator and reenables the ring buffer. ring_buffer_iter_peek: Look at the next item in the cpu iterator. ring_buffer_read: Read the iterator and increment it. ring_buffer_iter_reset: Reset the iterator to point to the beginning of the cpu buffer. ring_buffer_iter_empty: Returns true if the iterator is at the end of the cpu buffer. ring_buffer_size: returns the size in bytes of each cpu buffer. Note, the real size is this times the number of CPUs. ring_buffer_reset_cpu: Sets the cpu buffer to empty ring_buffer_reset: sets all cpu buffers to empty ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a cpu buffer of another buffer. This is handy when you want to take a snap shot of a running trace on just one cpu. Having a backup buffer, to swap with facilitates this. Ftrace max latencies use this. ring_buffer_empty: Returns true if the ring buffer is empty. ring_buffer_empty_cpu: Returns true if the cpu buffer is empty. ring_buffer_record_disable: disable all cpu buffers (read only) ring_buffer_record_disable_cpu: disable a single cpu buffer (read only) ring_buffer_record_enable: enable all cpu buffers. ring_buffer_record_enabl_cpu: enable a single cpu buffer. ring_buffer_entries: The number of entries in a ring buffer. ring_buffer_overruns: The number of entries removed due to writing wrap. ring_buffer_time_stamp: Get the time stamp used by the ring buffer ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp into nanosecs. I still need to implement the GTOD feature. But we need support from the cpu frequency infrastructure. But this can be done at a later time without affecting the ring buffer interface. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:54 +02:00
Steven Rostedt	5aa60c6073	ftrace: give time for wakeup test to run It is possible that the testing thread in the ftrace wakeup test does not run before we stop the trace. This will cause the trace to fail since nothing will be in the buffers. This patch adds a small wait in the wakeup test to allow for the woken task to run and be traced. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:53 +02:00

1 2 3 4 5 ...

5171 Commits