Commit Graph

11441 Commits

Author SHA1 Message Date
Eric W. Biederman
6b49a25785 [PATCH] sysctl: fix sys_sysctl interface of ipc sysctls
Currently there is a regression and the ipc sysctls don't show up in the
binary sysctl namespace.

This patch adds sysctl_ipc_data to read data/write from the appropriate
namespace and deliver it in the expected manner.

[akpm@osdl.org: warning fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Eric W. Biederman
9bc9a6bd3c [PATCH] sysctl: simplify ipc ns specific sysctls
Refactor the ipc sysctl support so that it is simpler, more readable, and
prepares for fixing the bug with the wrong values being returned in the
sys_sysctl interface.

The function proc_do_ipc_string() was misnamed as it never handled strings.
It's magic of when to work with strings and when to work with longs belonged
in the sysctl table.  I couldn't tell if the code would work if you disabled
the ipc namespace but it certainly looked like it would have problems.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Eric W. Biederman
c4b8b769fa [PATCH] sysctl: implement sysctl_uts_string()
The problem: When using sys_sysctl we don't read the proper values for the
variables exported from the uts namespace, nor do we do the proper locking.

This patch introduces sysctl_uts_string which properly fetches the values and
does the proper locking.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Eric W. Biederman
cf9f151c72 [PATCH] sysctl: simplify sysctl_uts_string
The binary interface to the namespace sysctls was never implemented resulting
in some really weird things if you attempted to use sys_sysctl to read your
hostname for example.

This patch series simples the code a little and implements the binary sysctl
interface.

In testing this patch series I discovered that our 32bit compatibility for the
binary sysctl interface is imperfect.  In particular KERN_SHMMAX and
KERN_SMMALL are size_t sized quantities and are returned as 8 bytes on to
32bit binaries using a x86_64 kernel.  However this has existing for a long
time so it is not a new regression with the namespace work.

Gads the whole sysctl thing needs work before it stops being easy to shoot
yourself in the foot.

Looking forward a little bit we need a better way to handle sysctls and
namespaces as our current technique will not work for the network namespace.
I think something based on the current overlapping sysctl trees will work but
the proc side needs to be redone before we can use it.

This patch:

Introduce get_uts() and put_uts() (used later) and remove most of the special
cases for when UTS namespace is compiled in.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Oleg Nesterov
62dfb5541a [PATCH] session_of_pgrp: kill unnecessary do_each_task_pid(PIDTYPE_PGID)
All members of the process group have the same sid and it can't be == 0.

NOTE: this code (and a similar one in sys_setpgid) was needed because it
was possibe to have ->session == 0. It's not possible any longer since

	[PATCH] pidhash: don't use zero pids
	Commit: c7c6464117

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Oleg Nesterov
f020bc468f [PATCH] sys_setpgid: eliminate unnecessary do_each_task_pid(PIDTYPE_PGID)
All tasks in the process group have the same sid, we don't need to iterate
them all to check that the caller of sys_setpgid() doesn't change its
session.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Sukadev Bhattiprolu
84d737866e [PATCH] add child reaper to pid_namespace
Add a per pid_namespace child-reaper.  This is needed so processes are reaped
within the same pid space and do not spill over to the parent pid space.  Its
also needed so containers preserve existing semantic that pid == 1 would reap
orphaned children.

This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Cedric Le Goater
6cc1b22a4a [PATCH] use current->nsproxy->pid_ns
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Cedric Le Goater
9a575a92db [PATCH] to nsproxy
Add the pid namespace framework to the nsproxy object.  The copy of the pid
namespace only increases the refcount on the global pid namespace,
init_pid_ns, and unshare is not implemented.

There is no configuration option to activate or deactivate this feature
because this not relevant for the moment.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Sukadev Bhattiprolu
61a58c6c23 [PATCH] rename struct pspace to struct pid_namespace
Rename struct pspace to struct pid_namespace for consistency with other
namespaces (uts_namespace and ipc_namespace).  Also rename
include/linux/pspace.h to include/linux/pid_namespace.h and variables from
pspace to pid_ns.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Cedric Le Goater
373beb35cd [PATCH] identifier to nsproxy
Add an identifier to nsproxy.  The default init_ns_proxy has identifier 0 and
allocated nsproxies are given -1.

This identifier will be used by a new syscall sys_bind_ns.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Kirill Korotaev
6b3286ed11 [PATCH] rename struct namespace to struct mnt_namespace
Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
other namespaces being developped for the containers : pid, uts, ipc, etc.
'namespace' variables and attributes are also renamed to 'mnt_ns'

Signed-off-by: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Cedric Le Goater
1ec320afdc [PATCH] add process_session() helper routine: deprecate old field
Add an anonymous union and ((deprecated)) to catch direct usage of the
session field.

[akpm@osdl.org: fix various missed conversions]
[jdike@addtoit.com: fix UML bug]
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Cedric Le Goater
937949d9ed [PATCH] add process_session() helper routine
Replace occurences of task->signal->session by a new process_session() helper
routine.

It will be useful for pid namespaces to abstract the session pid number.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Josef Sipek
a7a005fd12 [PATCH] struct path: convert kernel
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:46 -08:00
Josef "Jeff" Sipek
f3a43f3f64 [PATCH] kernel: change uses of f_{dentry, vfsmnt} to use f_path
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in
linux/kernel/.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:42 -08:00
NeilBrown
d63a5a74de [PATCH] lockdep: avoid lockdep warning in md
md_open takes ->reconfig_mutex which causes lockdep to complain.  This
(normally) doesn't have deadlock potential as the possible conflict is with a
reconfig_mutex in a different device.

I say "normally" because if a loop were created in the array->member hierarchy
a deadlock could happen.  However that causes bigger problems than a deadlock
and should be fixed independently.

So we flag the lock in md_open as a nested lock.  This requires defining
mutex_lock_interruptible_nested.

Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:39 -08:00
Oleg Nesterov
dae3c5a0b7 [PATCH] sys_unshare: remove a broken CLONE_SIGHAND code
sys_unshare(CLONE_SIGHAND) is broken, the code under 'if (new_sigh)' is
never executed but very wrong. Just remove it to avoid a confusion,
task_lock() has nothing to do with ->sighand changing.

Also, change the comment in unshare_sighand(). Yes, CLONE_THREAD implies
CLONE_SIGHAND, but still it looks confusing. Also, we don't need to check
current->sighand != NULL.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Oleg Nesterov
ae424ae4b5 [PATCH] make set_special_pids() static
Make set_special_pids() static, the only caller is daemonize().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Oleg Nesterov
7bcfa95e56 [PATCH] do_acct_process(): don't take tty_mutex
No need to take the global tty_mutex, signal->tty->driver can't go away while
we are holding ->siglock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Peter Zijlstra
24ec839c43 [PATCH] tty: ->signal->tty locking
Fix the locking of signal->tty.

Use ->sighand->siglock to protect ->signal->tty; this lock is already used
by most other members of ->signal/->sighand.  And unless we are 'current'
or the tasklist_lock is held we need ->siglock to access ->signal anyway.

(NOTE: sys_unshare() is broken wrt ->sighand locking rules)

Note that tty_mutex is held over tty destruction, so while holding
tty_mutex any tty pointer remains valid.  Otherwise the lifetime of ttys
are governed by their open file handles.  This leaves some holes for tty
access from signal->tty (or any other non file related tty access).

It solves the tty SLAB scribbles we were seeing.

(NOTE: the change from group_send_sig_info to __group_send_sig_info needs to
       be examined by someone familiar with the security framework, I think
       it is safe given the SEND_SIG_PRIV from other __group_send_sig_info
       invocations)

[schwidefsky@de.ibm.com: 3270 fix]
[akpm@osdl.org: various post-viro fixes]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Hidetoshi Seto
6e2ac66470 [PATCH] CPEI gets warning at kernel/irq/migration.c:27/move_masked_irq()
While running my MCA test (hardware error injection) on 2.6.19,
I got some warning like following:

> BUG: warning at kernel/irq/migration.c:27/move_masked_irq()
>
> Call Trace:
>  [<a000000100013d20>] show_stack+0x40/0xa0
>                                 sp=e00000006b2578d0 bsp=e00000006b2510b0
>  [<a000000100013db0>] dump_stack+0x30/0x60
>                                 sp=e00000006b257aa0 bsp=e00000006b251098
>  [<a0000001000de430>] move_masked_irq+0xb0/0x240
>                                 sp=e00000006b257aa0 bsp=e00000006b251070
>  [<a0000001000de6a0>] move_native_irq+0xe0/0x180
>                                 sp=e00000006b257aa0 bsp=e00000006b251040
>  [<a00000010004ff50>] iosapic_end_level_irq+0x30/0xe0
>                                 sp=e00000006b257aa0 bsp=e00000006b251020
>  [<a0000001000d94d0>] __do_IRQ+0x170/0x400
>                                 sp=e00000006b257aa0 bsp=e00000006b250fd8
>  [<a0000001000116f0>] ia64_handle_irq+0x1b0/0x260
>                                 sp=e00000006b257aa0 bsp=e00000006b250fa8
>  [<a00000010000c3a0>] ia64_leave_kernel+0x0/0x280
>                                 sp=e00000006b257aa0 bsp=e00000006b250fa8
>  [<a000000100690cf0>] _spin_unlock_irqrestore+0x30/0x60
>                                 sp=e00000006b257c70 bsp=e00000006b250f90

It comes from:

[kernel/irq/migration.c]
  26         if (CHECK_IRQ_PER_CPU(desc->status)) {
  27                 WARN_ON(1);
  28                 return;
  29         }

By putting some printk in kernel, I found that irqbalance is trying to
move CPEI which is handled as PER_CPU irq. That's why.

CPEI(Corrected Platform Error Interrupt) is ia64 specific irq, is
allowed to pin to particular processor which selected by the platform, and
even it is PER_CPU but it has set_affinity handler (=iosapic_set_affinity)
as same as other IO-SAPIC-level interrupts. (I don't know why, but
I guess that there would be typical situation where the handler for
migration is needed, such as hotplug - the processor going to be
offline/hot-removed.)

To shut up this warning, there are 2 way at least:
 a) fix CPEI stuff
 b) prohibit setting affinity to PER_CPU irq

I'm not sure what stuff of CPEI need to be fixed, but I think that
returning error to attempting move PER_CPU irq is useful for all
applications since it will never work.

Following small patch takes b) style.
It works, the warning disappeared and irqbalance still runs well.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:37 -08:00
Jan Beulich
aad094701c [PATCH] move kallsyms data to .rodata
Kallsyms data is never written to, so it can as well benefit from
CONFIG_DEBUG_RODATA.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:37 -08:00
Linus Torvalds
6ee7e78e7c Merge branch 'release' of master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] replace kmalloc+memset with kzalloc
  [IA64] resolve name clash by renaming is_available_memory()
  [IA64] Need export for csum_ipv6_magic
  [IA64] Fix DISCONTIGMEM without VIRTUAL_MEM_MAP
  [PATCH] Add support for type argument in PAL_GET_PSTATE
  [IA64] tidy up return value of ip_fast_csum
  [IA64] implement csum_ipv6_magic for ia64.
  [IA64] More Itanium PAL spec updates
  [IA64] Update processor_info features
  [IA64] Add se bit to Processor State Parameter structure
  [IA64] Add dp bit to cache and bus check structs
  [IA64] SN: Correctly update smp_affinty mask
  [IA64] sparse cleanups
  [IA64] IA64 Kexec/kdump
2006-12-07 15:39:22 -08:00
Zou Nan hai
a79561134f [IA64] IA64 Kexec/kdump
Changes and updates.

1. Remove fake rendz path and related code according to discuss with Khalid Aziz.
2. fc.i offset fix in relocate_kernel.S.
3. iospic shutdown code eoi and mask race fix from Fujitsu.
4. Warm boot hook in machine_kexec to SN SAL code from Jack Steiner.
5. Send slave to SAL slave loop patch from Jay Lan.
6. Kdump on non-recoverable MCA event patch from Jay Lan
7. Use CTL_UNNUMBERED in kdump_on_init sysctl.

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-12-07 09:51:35 -08:00
Linus Torvalds
68380b5813 Add "run_scheduled_work()" workqueue function
This allows workqueue users to run just their own pending work, rather
than wait for the whole workqueue to finish running.  This solves the
deadlock with networking libphy that was due to other workqueue entries
possibly needing a lock that was held by the routine that wanted to
flush its own work.

It's not wonderful: if you absolutely need to synchronize with the work
function having been executed, any user strictly speaking should have
its own completion tracking logic, since when we run things explicitly
by hand, the generic workqueue layer can no longer help us synchronize.

Also, this is strictly only usable for work that has been scheduled
without any delayed timers.  You can not mix the new interface with
schedule_delayed_work().

But it's better than what we had currently.

Acked-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 09:28:19 -08:00
Linus Torvalds
2685b267bc Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (48 commits)
  [NETFILTER]: Fix non-ANSI func. decl.
  [TG3]: Identify Serdes devices more clearly.
  [TG3]: Use msleep.
  [TG3]: Use netif_msg_*.
  [TG3]: Allow partial speed advertisement.
  [TG3]: Add TG3_FLG2_IS_NIC flag.
  [TG3]: Add 5787F device ID.
  [TG3]: Fix Phy loopback.
  [WANROUTER]: Kill kmalloc debugging code.
  [TCP] inet_twdr_hangman: Delete unnecessary memory barrier().
  [NET]: Memory barrier cleanups
  [IPSEC]: Fix inetpeer leak in ipv4 xfrm dst entries.
  audit: disable ipsec auditing when CONFIG_AUDITSYSCALL=n
  audit: Add auditing to ipsec
  [IRDA] irlan: Fix compile warning when CONFIG_PROC_FS=n
  [IrDA]: Incorrect TTP header reservation
  [IrDA]: PXA FIR code device model conversion
  [GENETLINK]: Fix misplaced command flags.
  [NETLIK]: Add a pointer to the Generic Netlink wiki page.
  [IPV6] RAW: Don't release unlocked sock.
  ...
2006-12-07 09:05:15 -08:00
Linus Torvalds
4522d58275 Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (156 commits)
  [PATCH] x86-64: Export smp_call_function_single
  [PATCH] i386: Clean up smp_tune_scheduling()
  [PATCH] unwinder: move .eh_frame to RODATA
  [PATCH] unwinder: fully support linker generated .eh_frame_hdr section
  [PATCH] x86-64: don't use set_irq_regs()
  [PATCH] x86-64: check vector in setup_ioapic_dest to verify if need setup_IO_APIC_irq
  [PATCH] x86-64: Make ix86 default to HIGHMEM4G instead of NOHIGHMEM
  [PATCH] i386: replace kmalloc+memset with kzalloc
  [PATCH] x86-64: remove remaining pc98 code
  [PATCH] x86-64: remove unused variable
  [PATCH] x86-64: Fix constraints in atomic_add_return()
  [PATCH] x86-64: fix asm constraints in i386 atomic_add_return
  [PATCH] x86-64: Correct documentation for bzImage protocol v2.05
  [PATCH] x86-64: replace kmalloc+memset with kzalloc in MTRR code
  [PATCH] x86-64: Fix numaq build error
  [PATCH] x86-64: include/asm-x86_64/cpufeature.h isn't a userspace header
  [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder
  [PATCH] x86-64: Clarify error message in GART code
  [PATCH] x86-64: Fix interrupt race in idle callback (3rd try)
  [PATCH] x86-64: Remove unwind stack pointer alignment forcing again
  ...

Fixed conflict in include/linux/uaccess.h manually

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:59:11 -08:00
Paul Menage
d3ed11c356 [PATCH] cpuset: allow a larger buffer for writes to cpuset files
When using fake NUMA setup, the number of memory nodes can greatly exceed
the number of CPUs.  So the current limit in cpuset_common_file_write() is
insufficient.

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Ingo Molnar
7929082250 [PATCH] add ignore_loglevel boot option
Sometimes the kernel prints something interesting while userspace bootup
keeps messages turned off via loglevel.  Enable the printing of /all/
kernel messages via the "ignore_loglevel" boot option.  Off by default.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:47 -08:00
Ingo Molnar
70e4506765 [PATCH] lockdep: register_lock_class() fix
The hash_lock must only ever be taken with irqs disabled.  This happens in
all the important places, except one codepath: register_lock_class().  The
race should trigger rarely because register_lock_class() is quite rare and
single-threaded (happens during init most of the time).

The fix is to disable irqs.

( bug found live in -rt: there preemption is alot more agressive and
  preempting with the hash-lock held caused a lockup.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Magnus Damm
85916f8166 [PATCH] Kexec / Kdump: Unify elf note code
The elf note saving code is currently duplicated over several
architectures.  This cleanup patch simply adds code to a common file and
then replaces the arch-specific code with calls to the newly added code.

The only drawback with this approach is that s390 doesn't fully support
kexec-on-panic which for that arch leads to introduction of unused code.

Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Helge Deller
15ad7cdcfd [PATCH] struct seq_operations and struct file_operations constification
- move some file_operations structs into the .rodata section

 - move static strings from policy_types[] array into the .rodata section

 - fix generic seq_operations usages, so that those structs may be defined
   as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Ralf Baechle
ccdea2f88b [PATCH] futex: remove unneeded barrier
When disassembling a kernel I found around over 90 sync Instructions from
mb, rmb and wmb calls in the kernel and only few of those make any sense to
me.  So here's the first one - I think the wmb() in kernel/futex.c is not
needed on uniprocessors so should become an smb_wmb().

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:45 -08:00
Josh Triplett
012d3ca8d8 [PATCH] Add Sparse annotations to SRCU wrapper functions in rcutorture
The SRCU wrapper functions srcu_torture_read_lock and
srcu_torture_read_unlock in rcutorture intentionally change the SRCU
context; annotate them accordingly, to avoid a warning.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
David Rientjes
a09c17a6fd [PATCH] sys: remove unused variable
Remove unused 'new_ruid' variable.

Reported by David Binderman <dcb314@hotmail.com>.

Signed-off-by: David Rientjes <rientjes@cs.washington.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Adrian Bunk
045f147f32 [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols
In time for 2.6.20, we can get rid of this junk.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Zachary Amsden
3908fd2ed9 [PATCH] softirq: remove BUG_ONs which can incorrectly trigger
It is possible to have tasklets get scheduled before softirqd has had a chance
to spawn on all CPUs.  This is totally harmless; after success during action
CPU_UP_PREPARE, action CPU_ONLINE will be called, which immediately wakes
softirqd on the appropriate CPU to process the already pending tasklets.  So
there is no danger of having a missed wakeup for any tasklets that were
already pending.

In particular, i386 is affected by this during startup, and is visible when
using a very large initrd; during the time it takes for the initrd to be
decompressed, a timer IRQ can come in and schedule RCU callbacks.  It is also
possible that resending of a hardware IRQ via a softirq triggers the same bug.

Because of different timing conditions, this shows up in all emulators and
virtual machines tested, including Xen, VMware, Virtual PC, and Qemu.  It is
also possible to trigger on native hardware with a large enough initrd,
although I don't have a reliable case demonstrating that.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: <caglar@pardus.org.tr>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:43 -08:00
Ingo Molnar
2ee91f197c [PATCH] lockdep: show more details about self-test failures
Make the locking self-test failures (of 'FAILURE' type) easier to debug by
printing more information.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:43 -08:00
Ingo Molnar
50cc670aeb [PATCH] lockdep: more chains
Some have reported a chain-table overflow - double its size.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:43 -08:00
Chris Caputo
301827acbe [PATCH] sched: correct output of show_state()
At present show_state prints a header the does not match the output of
show_task, as follows:

-
                                               sibling
  task             PC      pid father child younger older
init          S 00000000     0     1      0     2               (NOTLB)
-

This patch corrects the output of show_state so that the header is
aligned with the data, ala:

-
                         free                        sibling
  task             PC    stack   pid father child younger older
init          S 00000000     0     1      0     2               (NOTLB)
-

Signed-off-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:42 -08:00
BP, Praveen
bd9b0bac6f [PATCH] sysctl: string length calculated is wrong if it contains negative numbers
In the functions do_proc_dointvec() and do_proc_doulongvec_minmax(),
there seems to be a bug in string length calculation if string contains
negative integer.

The console log given below explains the bug. Setting negative values
may not be a right thing to do for "console log level" but then the test
(given below) can be used to demonstrate the bug in the code.

# echo "-1 -1 -1 -123456" > /proc/sys/kernel/printk
# cat /proc/sys/kernel/printk
-1      -1      -1      -1234
#
# echo "-1 -1 -1 123456" > /proc/sys/kernel/printk
# cat /proc/sys/kernel/printk
-1      -1      -1      1234
#

(akpm: the bug is that 123456 gets truncated)

It works as expected if string contains all +ve integers

# echo "1 2 3 4" > /proc/sys/kernel/printk
# cat /proc/sys/kernel/printk
1       2       3       4
#

The patch given below fixes the issue.

Signed-off-by: Praveen BP <praveenbp@ti.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:42 -08:00
Akinobu Mita
95362fa903 [PATCH] futex: init error check
Check register_filesystem() and kern_mount() return values.

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:41 -08:00
Burman Yan
4668edc334 [PATCH] kernel core: replace kmalloc+memset with kzalloc
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:41 -08:00
Eric Dumazet
1c69d921ed [PATCH] rcu: add a prefetch() in rcu_do_batch()
On some workloads, (for example when lot of close() syscalls are done), RCU
qlen can be quite large, and RCU heads are no longer in cpu cache when
rcu_do_batch() is called.

This patch adds a prefetch() in rcu_do_batch() to give CPU a hint to bring
back cache lines containing 'struct rcu_head's.

Most list manipulations macros include prefetch(), but not open coded ones
(at least with current C compilers :) )

I got a nice speedup on a trivial benchmark (3.48 us per iteration instead
of 3.95 us on a 1.6 GHz Pentium-M)

while (1) { pipe(p); close(fd[0]); close(fd[1]);}

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:40 -08:00
Adrian Bunk
d3228a887c [PATCH] make kernel/signal.c:kill_proc_info() static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Adrian Bunk
ebe7e5fe4b [PATCH] remove kernel/lockdep.c:lockdep_internal
Remove the no longer used lockdep_internal().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Ingo Molnar
0231606785 [PATCH] hotplug CPU: clean up hotcpu_notifier() use
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

    text    data     bss     dec     hex filename
 1624412  728710 3674856 6027978  5bfaca vmlinux.before
 1624412  728710 3674856 6027978  5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Masami Hiramatsu
b4c6c34a53 [PATCH] kprobes: enable booster on the preemptible kernel
When we are unregistering a kprobe-booster, we can't release its
instruction buffer immediately on the preemptive kernel, because some
processes might be preempted on the buffer.  The freeze_processes() and
thaw_processes() functions can clean most of processes up from the buffer.
There are still some non-frozen threads who have the PF_NOFREEZE flag.  If
those threads are sleeping (not preempted) at the known place outside the
buffer, we can ensure safety of freeing.

However, the processing of this check routine takes a long time.  So, this
patch introduces the garbage collection mechanism of insn_slot.  It also
introduces the "dirty" flag to free_insn_slot because of efficiency.

The "clean" instruction slots (dirty flag is cleared) are released
immediately.  But the "dirty" slots which are used by boosted kprobes, are
marked as garbages.  collect_garbage_slots() will be invoked to release
"dirty" slots if there are more than INSNS_PER_PAGE garbage slots or if
there are no unused slots.

Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "bibo,mao" <bibo.mao@intel.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Yumiko Sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi Oshima <soshima@redhat.com>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:38 -08:00
Mike Galbraith
c36264dfb2 [PATCH] remove the syslog interface when printk is disabled
Attempts to read() from the non-existent dmesg buffer will return zero and
userspace tends to get stuck in a busyloop.

So just remove /dev/kmsg altogether if CONFIG_PRINTK=n.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:38 -08:00
Alan Cox
40fcfc8722 [PATCH] HZ: 300Hz support
Fix two things.  Firstly the unit is "Hz" not "HZ".  Secondly it is useful
to have 300Hz support when doing multimedia work.  250 is fine for us in
Europe but the US frame rate is 30fps (29.99 blah for pedants).  300 gives
us a tick divisible by both 25 and 30, and for interlace work 50 and 60.
It's also giving similar performance to 250Hz.

I'd argue we should remove 250 and add 300, but that might be excess
disruption for now.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Peter Zijlstra
d5abe66917 [PATCH] debug: workqueue locking sanity
Workqueue functions should not leak locks, assert so, printing the
last function ran.

Use macros in lockdep.h to avoid include dependency pains.

[akpm@osdl.org: build fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Ingo Molnar
ece8a684c7 [PATCH] sleep profiling
Implement prof=sleep profiling.  TASK_UNINTERRUPTIBLE sleeps will be taken
as a profile hit, and every millisecond spent sleeping causes a profile-hit
for the call site that initiated the sleep.

Sample readprofile output on i386:

   306 ps2_sendbyte                               1.3973
   432 call_usermodehelper_keys                   1.9548
   484 ps2_command                                0.6453
   790 __driver_attach                            4.7879
  1593 msleep                                    44.2500
  3976 sync_buffer                               64.1290
  4076 do_lookup                                 12.4648
  8587 sync_page                                122.6714
 20820 total                                      0.0067

(NOTE: architectures need to check whether get_wchan() can be called from
deep within the wakeup path.)

akpm: we need to mark more functions __sched.  lock_sock(), msleep(), others..

akpm: the contention in do_lookup() is a surprise.  Presumably doing disk
reads for directory contents while holding i_mutex.

[akpm@osdl.org: various fixes]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Peter Zijlstra
6cfd76a26d [PATCH] lockdep: name some old style locks
Name some of the remaning 'old_style_spin_init' locks

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Peter Zijlstra
a4c410f00f [PATCH] lockdep: print current locks on in_atomic warnings
Add debug_show_held_locks(current) to __might_sleep() and schedule(); this
makes finding the offending lock leak easier.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Oleg Nesterov
3716748530 [PATCH] taskstats: cleanup reply assembling
Thomas Graf wrote:
>
> nla_nest_start() may return NULL, either rely on prepare_reply() to be
> correct and BUG() on failure or do proper error handling for all
> functions.

nla_put() in taskstat.c can fail only if the 'size' argument of alloc_skb()
was not right. This is a kernel bug, we should not hide it. So add 'BUG()'
on error path and check for 'na == NULL'.

> genlmsg_cancel() is only required in error paths for dumping
> procedures.

So we can remove 'genlmsg_cancel()' calls and 'void *reply' (saves 227 bytes).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
51de4d9085 [PATCH] taskstats: use nla_reserve() for reply assembling
Currently taskstats_user_cmd()/taskstats_exit() do:

	1) allocate stats
	2) fill stats
	3) make a temporary copy on stack (236 bytes)
	4) copy that copy to skb
	5) free stats

With the help of nla_reserve() we can operate on skb->data directly,
thus avoiding all these steps except 2).

So, before this patch:

	// copy *stats to skb->data
	int mk_reply(skb, ..., struct taskstats *stats);

	fill_pid(stats);
	mk_reply(skb, ..., stats);

After:
	// return a pointer to skb->data
	struct taskstats *mk_reply(skb, ...);

	stat = mk_reply(skb, ...);
	fill_pid(stats);

Shrinks taskatsks.o by 162 bytes.

A stupid benchmark (send one million TASKSTATS_CMD_ATTR_PID) shows the

		real user sys
	before:
		4.02 0.06 3.96
		4.02 0.04 3.98
		4.02 0.04 3.97
	after:
		3.86 0.08 3.78
		3.88 0.10 3.77
		3.89 0.09 3.80

but this looks suspiciously good.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
68062b86fc [PATCH] taskstats: factor out reply assembling
Introduce mk_reply() helper which does all nla_put()s on reply.

Saves 453 bytes and a preparation for the next patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
34ec12349c [PATCH] taskstats: cleanup ->signal->stats allocation
Allocate ->signal->stats on demand in taskstats_exit(), this allows us to
remove taskstats_tgid_alloc() (the last non-trivial inline) from taskstat's
public interface.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
115085ea07 [PATCH] taskstats: cleanup do_exit() path
do_exit:
	taskstats_exit_alloc()
	...
	taskstats_exit_send()
	taskstats_exit_free()

I think this is not good, let it be a single function exported to the core
kernel, taskstats_exit(), which does alloc + send + free itself.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
128fb95650 [PATCH] taskstats_exit_alloc: optimize/simplify
If there are no listeners, every task does unneeded kmem_cache alloc/free on
exit. We don't need listeners->sem for 'if (!list_empty())' check. Yes, we may
have a false positive, but this doesn't differ from the case when the listener
is unregistered after we drop the semaphore. So we don't need to do allocation
beforehand.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Heiko Carstens
064b022c7a [PATCH] profile: fix uaccess handling
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Roland McGrath
fec1d01152 [PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit
The CLONE_CHILD_CLEARTID flag is used by NPTL to have its threads
communicate via memory/futex when they exit, so pthread_join can
synchronize using a simple futex wait.  The word of user memory where NPTL
stores a thread's own TID is what it passes; this gets reset to zero at
thread exit.

It is not desireable to touch this user memory when threads are dying due
to a fatal signal.  A core dump is more usefully representative of the
dying program state if the threads live at the time of the crash have their
NPTL data structures unperturbed.  The userland expectation of
CLONE_CHILD_CLEARTID has only ever been that it works for a thread making
an _exit system call.

This problem was identified by Ernie Petrides <petrides@redhat.com>.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ernie Petrides <petrides@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Jarek Poplawski
b23984d0a1 [PATCH] lockdep: misc fixes in lockdep.c
- numeric string size replaced with constant in print_lock_name and
   print_lockdep_cache,

 - return on null pointer in print_lock_dependencies,

 - one more lockdep return with 0 with unlocking fix in mark_lock.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Jarek Poplawski
910b1b2e6d [PATCH] lockdep: internal locking fixes
Here are mainly some lockdep returns with 0 with unlocking fixes.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Paul Jackson
696040670a [PATCH] cpuset: minor code refinements
A couple of minor code simplifications to the kernel/cpuset.c code.  No
functional change.  Just a little less code and a little more readable.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
Ralf Baechle
4cf303487d [PATCH] Export pm_suspend for the shared APM emulation
The new shared APM emulation just like its ARM and MIPS predecessors uses
pm_suspend() which was only exported on SH.  Move export to close to it's
definition where it really should be anyway.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
Ingo Molnar
e59e2ae2c2 [PATCH] SysRq-X: show blocked tasks
Add SysRq-X support: show blocked (TASK_UNINTERRUPTIBLE) tasks only.

Useful for debugging IO stalls.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
Adam B. Jerome
07354a0090 [PATCH] /proc/kallsyms reports lower-case types for some non-exported symbols
This patch addresses incorrect symbol type information reported through
/proc/kallsyms.  A lowercase character should designate the symbol as local
(or non-exported).  An uppercase character should designate the symbol as
global (or external).

Without this patch, some non-exported symbols are incorrectly assigned an
upper-case designation in /proc/kallsyms.  This patch corrects this
condition by converting non-exported symbols types to lower case when
appropriate and eliminates the superfluous upcase_if_global function

Signed-off-by: Adam B. Jerome <abj@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:31 -08:00
Peter Zijlstra
ed07536ed6 [PATCH] lockdep: annotate nfs/nfsd in-kernel sockets
Stick NFS sockets in their own class to avoid some lockdep warnings.  NFS
sockets are never exposed to user-space, and will hence not trigger certain
code paths that would otherwise pose deadlock scenarios.

[akpm@osdl.org: cleanups]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Dickson <SteveD@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: Neil Brown <neilb@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
[ Fixed patch corruption by quilt, pointed out by Peter Zijlstra ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:30 -08:00
Rafael J. Wysocki
341a595850 [PATCH] Support for freezeable workqueues
Make it possible to create a workqueue the worker thread of which will be
frozen during suspend, along with other kernel threads.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:29 -08:00
Pavel Machek
5045cfc103 [PATCH] swsusp: kill write-only variable
Cleanup write-only variable, suggested by D Binderman.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:29 -08:00
Rafael J. Wysocki
2d87595ea6 [PATCH] PM: Fix swsusp debug mode testproc
The 'testproc' swsusp debug mode thaws tasks twice in a row, which is _very_
confusing.  Fix that.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
59a493350e [PATCH] swsusp: Fix labels
Move all labels in the swsusp code to the second column, so that they won't
fool diff -p.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
5b6d15de2d [PATCH] swsusp: Fix coding style in suspend.c
Fix coding style in suspend.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
11b2ce2ba9 [PATCH] swsusp: Untangle freeze_processes
Move the loop from freeze_processes() to a separate function and call it
independently for user space processes and kernel threads so that the order
of freezing tasks is clearly visible.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
a9b6f562f1 [PATCH] swsusp: Untangle thaw_processes
Move the loop from thaw_processes() to a separate function and call it
independently for kernel threads and user space processes so that the order
of thawing tasks is clearly visible.

Drop thaw_kernel_threads() which is never used.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Stephen Hemminger
a6d7098060 [PATCH] convert pm_sem to a mutex
The power management semaphore is only used as mutex, so convert it.

[akpm@osdl.org: fix rotten bug]
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
3eb1b3a407 [PATCH] suspend to disk fails if gdb is suspended with a traced child
Fix http://bugzilla.kernel.org/show_bug.cgi?id=7534

Fix the freezing of processes so that it won't fail if there is a traced
process the parent of which has been stopped.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: maurice barnum <pixi+kbug@burble.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
0d3a9abe8a [PATCH] swsusp: Measure memory shrinking time
Make swsusp measure and print the time needed to shrink memory during the
suspend.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Siddha, Suresh B
112cecb2cc [PATCH] suspend: don't change cpus_allowed for task initiating the suspend
Don't modify the cpus_allowed of the task initiating the suspend.
_cpu_down() already makes sure that the task doing the suspend doesn't run
on dying cpu.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
2d4a34c936 [PATCH] swsusp: Support i386 systems with PAE or without PSE
Make swsusp support i386 systems with PAE or without PSE.

This is done by creating temporary page tables located in resume-safe page
frames before the suspend image is restored in the same way as x86_64 does
it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Nigel Cunningham <ncunningham@linuxmail.org>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Nigel Cunningham
ff39593ad0 [PATCH] swsusp: thaw userspace and kernel space separately
Modify process thawing so that we can thaw kernel space without thawing
userspace, and thaw kernelspace first.  This will be useful in later
patches, where I intend to get swsusp thawing kernel threads only before
seeking to free memory.

Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Nigel Cunningham
14b5b7cfaa [PATCH] swsusp: clean up whitespace in freezer output
Minor whitespace and formatting modifications for the freezer.

Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Nigel Cunningham
32d50f57da [PATCH] swsusp: quieten Freezer if !CONFIG_PM_DEBUG
The freezer currently prints an '=' for every process that is frozen.  This
is pretty pointless, as the equals sign says nothing about which process is
frozen, and makes logs look messier (especially if there were a large
number of processes running).  All we really need to know is that we
started trying to freeze processes and what processes (if any) failed to
freeze, or that we succeeded.

Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Nigel Cunningham
7dfb71030f [PATCH] Add include/linux/freezer.h and move definitions from sched.h
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.

[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Stefan Seyfried
8a05aac263 [PATCH] swsusp: fix platform mode
At some point after 2.6.13, in-kernel software suspend got "incomplete" for
the so-called "platform" mode.  pm_ops->prepare() is never called.  A
visible sign of this is the "moon" light on thinkpads not flashing during
suspend.  Fix by readding the pm_ops->prepare call during suspend.

Signed-off-by: Stefan Seyfried <seife@suse.de>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
8594912187 [PATCH] swsusp: use __GFP_WAIT
swsusp uses GFP_ATOMIC, but it can afford to use __GFP_WAIT, which will
permit it to reclaim clean pagecache instead of emitting scary
page-allocation-failure messages.

Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
8357376d3d [PATCH] swsusp: Improve handling of highmem
Currently swsusp saves the contents of highmem pages by copying them to the
normal zone which is quite inefficient (eg.  it requires two normal pages
to be used for saving one highmem page).  This may be improved by using
highmem for saving the contents of saveable highmem pages.

Namely, during the suspend phase of the suspend-resume cycle we try to
allocate as many free highmem pages as there are saveable highmem pages.
If there are not enough highmem image pages to store the contents of all of
the saveable highmem pages, some of them will be stored in the "normal"
memory.  Next, we allocate as many free "normal" pages as needed to store
the (remaining) image data.  We use a memory bitmap to mark the allocated
free pages (ie.  highmem as well as "normal" image pages).

Now, we use another memory bitmap to mark all of the saveable pages
(highmem as well as "normal") and the contents of the saveable pages are
copied into the image pages.  Then, the second bitmap is used to save the
pfns corresponding to the saveable pages and the first one is used to save
their data.

During the resume phase the pfns of the pages that were saveable during the
suspend are loaded from the image and used to mark the "unsafe" page
frames.  Next, we try to allocate as many free highmem page frames as to
load all of the image data that had been in the highmem before the suspend
and we allocate so many free "normal" page frames that the total number of
allocated free pages (highmem and "normal") is equal to the size of the
image.  While doing this we have to make sure that there will be some extra
free "normal" and "safe" page frames for two lists of PBEs constructed
later.

Now, the image data are loaded, if possible, into their "original" page
frames.  The image data that cannot be written into their "original" page
frames are loaded into "safe" page frames and their "original" kernel
virtual addresses, as well as the addresses of the "safe" pages containing
their copies, are stored in one of two lists of PBEs.

One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
pages that were saveable during the suspend) and it is used in the same way
as previously (ie.  by the architecture-dependent parts of swsusp).  The
other list of PBEs is for the copies of highmem suspend pages.  The pages
in this list are restored (in a reversible way) right before the
arch-dependent code is called.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
37b2ba12df [PATCH] swsusp: add ioctl for swap files support
To be able to use swap files as suspend storage from the userland suspend
tools we need an additional ioctl() that will allow us to provide the kernel
with both the swap header's offset and the identification of the resume
partition.

The new ioctl() should be regarded as a replacement for the
SNAPSHOT_SET_SWAP_FILE ioctl() that from now on will be considered as
obsolete, but has to stay for backwards compatibility of the interface.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
9a154d9d95 [PATCH] swsusp: add resume_offset command line parameter
Add the kernel command line parameter "resume_offset=" allowing us to specify
the offset, in <PAGE_SIZE> units, from the beginning of the partition pointed
to by the "resume=" parameter at which the swap header is located.

This offset can be determined, for example, by an application using the FIBMAP
ioctl to obtain the swap header's block number for given file.

[akpm@osdl.org: we don't know what type sector_t is]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
3aef83e0ef [PATCH] swsusp: use block device offsets to identify swap locations
Make swsusp use block device offsets instead of swap offsets to identify swap
locations and make it use the same code paths for writing as well as for
reading data.

This allows us to use the same code for handling swap files and swap
partitions and to simplify the code, eg.  by dropping rw_swap_page_sync().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
3fc6b34f48 [PATCH] swsusp: rearrange swap-handling code
Rearrange the code in kernel/power/swap.c so that the next patch is more
readable.

[This patch only moves the existing code.]

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
915bae9ebe [PATCH] swsusp: use partition device and offset to identify swap areas
The Linux kernel handles swap files almost in the same way as it handles swap
partitions and there are only two differences between these two types of swap
areas:

(1) swap files need not be contiguous,

(2) the header of a swap file is not in the first block of the partition
    that holds it.  From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

In principle the location of a swap file's header may be determined with the
help of appropriate filesystem driver.  Unfortunately, however, it requires
the filesystem holding the swap file to be mounted, and if this filesystem is
journaled, it cannot be mounted during a resume from disk.  For this reason we
need some other means by which swap areas can be identified.

For example, to identify a swap area we can use the partition that holds the
area and the offset from the beginning of this partition at which the swap
header is located.

The following patch allows swsusp to identify swap areas this way.  It changes
swap_type_of() so that it takes an additional argument representing an offset
of the swap header within the partition represented by its first argument.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Stefan Seyfried
3592695c36 [PATCH] uswsusp: add pmops->{prepare,enter,finish} support (aka "platform mode")
Add an ioctl to the userspace swsusp code that enables the usage of the
pmops->prepare, pmops->enter and pmops->finish methods (the in-kernel
suspend knows these as "platform method").  These are needed on many
machines to (among others) speed up resuming by letting the BIOS skip some
steps or let my hp nx5000 recognise the correct ac_adapter state after
resume again.

It also ensures on many machines, that changed hardware (unplugged AC
adapters) gets correctly detected and that kacpid does not run wild after
resume.

Signed-off-by: Stefan Seyfried <seife@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:26 -08:00
Christoph Lameter
e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter
e94b176609 [PATCH] slab: remove SLAB_KERNEL
SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:24 -08:00
Peter Zijlstra
a866374aec [PATCH] mm: pagefault_{disable,enable}()
Introduce pagefault_{disable,enable}() and use these where previously we did
manual preempt increments/decrements to make the pagefault handler do the
atomic thing.

Currently they still rely on the increased preempt count, but do not rely on
the disabled preemption, this might go away in the future.

(NOTE: the extra barrier() in pagefault_disable might fix some holes on
       machines which have too many registers for their own good)

[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Ashwin Chaugule
7602bdf2fd [PATCH] new scheme to preempt swap token
The new swap token patches replace the current token traversal algo.  The old
algo had a crude timeout parameter that was used to handover the token from
one task to another.  This algo, transfers the token to the tasks that are in
need of the token.  The urgency for the token is based on the number of times
a task is required to swap-in pages.  Accordingly, the priority of a task is
incremented if it has been badly affected due to swap-outs.  To ensure that
the token doesnt bounce around rapidly, the token holders are given a priority
boost.  The priority of tasks is also decremented, if their rate of swap-in's
keeps reducing.  This way, the condition to check whether to pre-empt the swap
token, is a matter of comparing two task's priority fields.

[akpm@osdl.org: cleanups]
Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Joy Latten
161a09e737 audit: Add auditing to ipsec
An audit message occurs when an ipsec SA
or ipsec policy is created/deleted.

Signed-off-by: Joy Latten <latten@austin.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-06 20:14:22 -08:00
Jan Beulich
b65780e123 [PATCH] unwinder: move .eh_frame to RODATA
The .eh_frame section contents is never written to, so it can as well
benefit from CONFIG_DEBUG_RODATA.

Diff-ed against firstfloor tree.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:19 +01:00
Jan Beulich
c65f38d911 [PATCH] unwinder: fully support linker generated .eh_frame_hdr section
Now that binutils' ld is able to properly populate .eh_frame_hdr in the
Linux kernel case, here's a patch to add some functionality to the Dwarf2
unwinder to actually be able to make use of this (applies on firstfloor
tree with the previously sent patch to add debug output, but not on plain
2.6.19).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:19 +01:00
Jan Beulich
6d0185ea61 [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder
Add debugging printks to the unwinder to allow easier debugging
when something goes wrong with it.

This can be controlled with the new unwinder_debug=N option
Most output is given by N=1

AK: Added documentation of unwinder_debug=

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:13 +01:00
Jan Beulich
359ad0d401 [PATCH] unwinder: more sanity checks in Dwarf2 unwinder
Tighten the requirements on both input to and output from the Dwarf2
unwinder.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:13 +01:00
Andi Kleen
eef5e0d185 [PATCH] unwinder: Remove lockdep disabling of nested locks for unwinder
Shouldn't be needed anymore since __kernel_text_address
is used unconditionally on x86-64

Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:12 +01:00
Andi Kleen
e2124bb8d3 [PATCH] unwinder: Use probe_kernel_address instead of __get_user in kernel/unwind.c
This avoids trouble with the page fault handler if the fault
happens inside an interrupt context.

Suggested by Linus

Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:12 +01:00
Chuck Ebbert
0741f4d207 [PATCH] x86: add sysctl for kstack_depth_to_print
Add sysctl for kstack_depth_to_print. This lets users change
the amount of raw stack data printed in dump_stack() without
having to reboot.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:11 +01:00
Jeremy Fitzhardinge
f95d47caae [PATCH] i386: Use %gs as the PDA base-segment in the kernel
This patch is the meat of the PDA change.  This patch makes several related
changes:

1: Most significantly, %gs is now used in the kernel.  This means that on
   entry, the old value of %gs is saved away, and it is reloaded with
   __KERNEL_PDA.

2: entry.S constructs the stack in the shape of struct pt_regs, and this
   is passed around the kernel so that the process's saved register
   state can be accessed.

   Unfortunately struct pt_regs doesn't currently have space for %gs
   (or %fs). This patch extends pt_regs to add space for gs (no space
   is allocated for %fs, since it won't be used, and it would just
   complicate the code in entry.S to work around the space).

3: Because %gs is now saved on the stack like %ds, %es and the integer
   registers, there are a number of places where it no longer needs to
   be handled specially; namely context switch, and saving/restoring the
   register state in a signal context.

4: And since kernel threads run in kernel space and call normal kernel
   code, they need to be created with their %gs == __KERNEL_PDA.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2006-12-07 02:14:02 +01:00
David Howells
9db7372445 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/ata/libata-scsi.c
	include/linux/libata.h

Futher merge of Linus's head and compilation fixups.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 17:01:28 +00:00
David Howells
4c1ac1b491 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/infiniband/core/iwcm.c
	drivers/net/chelsio/cxgb2.c
	drivers/net/wireless/bcm43xx/bcm43xx_main.c
	drivers/net/wireless/prism54/islpci_eth.c
	drivers/usb/core/hub.h
	drivers/usb/input/hid-core.c
	net/core/netpoll.c

Fix up merge failures with Linus's head and fix new compilation failures.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 14:37:56 +00:00
Al Viro
a1f8e7f7fb [PATCH] severing skbuff.h -> highmem.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-12-04 02:00:29 -05:00
Al Viro
f6a570333e [PATCH] severing module.h->sched.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-12-04 02:00:22 -05:00
Thomas Graf
17c157c889 [GENL]: Add genlmsg_put_reply() to simplify building reply headers
By modyfing genlmsg_put() to take a genl_family and by adding
genlmsg_put_reply() the process of constructing the netlink
and generic netlink headers is simplified.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:42 -08:00
Thomas Graf
3dabc71578 [GENL]: Add genlmsg_new() to allocate generic netlink messages
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:40 -08:00
Thomas Graf
339bf98ffc [NETLINK]: Do precise netlink message allocations where possible
Account for the netlink message header size directly in nlmsg_new()
instead of relying on the caller calculate it correctly.

Replaces error handling of message construction functions when
constructing notifications with bug traps since a failure implies
a bug in calculating the size of the skb.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:11 -08:00
Kay Sievers
e17e0f51ae Driver core: show drivers in /sys/module/
Show the drivers, which belong to the module:
  $ ls -l /sys/module/usbcore/drivers/
  hub -> ../../../bus/usb/drivers/hub
  usb -> ../../../bus/usb/drivers/usb
  usbfs -> ../../../bus/usb/drivers/usbfs

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-12-01 14:52:02 -08:00
Linus Torvalds
707badb80b Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6:
  [PATCH] x86-64: Use stricter in process stack check for unwinder
  [PATCH] i386: Fix compilation with UP genericarch
  [PATCH] x86-64: Fix warning in io_apic.c
  [PATCH] x86-64: work around gcc4 issue with -Os in Dwarf2 stack unwind
  [PATCH] x86_64: Align data segment to PAGE_SIZE boundary
2006-11-28 17:28:41 -08:00
Akinobu Mita
3cce4856ff [PATCH] fix create_write_pipe() error check
The return value of create_write_pipe()/create_read_pipe() should be
checked by IS_ERR().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-28 17:26:50 -08:00
Jan Beulich
ff0a538d8b [PATCH] x86-64: work around gcc4 issue with -Os in Dwarf2 stack unwind
This fixes a problem with gcc4 mis-compiling the stack unwind code under
-Os, which resulted in 'stuck' messages whenever an assembly routine was
encountered.

(The second hunk is trivial cleanup.)

Signed-off-by: Jan Beulich <jbeulich@novell.com>
2006-11-28 20:12:59 +01:00
Arjan van de Ven
cfd3ef2346 [PATCH] lockdep: spin_lock_irqsave_nested()
Introduce spin_lock_irqsave_nested(); implementation from:
 http://lkml.org/lkml/2006/6/1/122
Patch from:
 http://lkml.org/lkml/2006/9/13/258

[akpm@osdl.org: two compile fixes]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jiri Kosina <jikos@jikos.cz>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-25 13:28:34 -08:00
Akinobu Mita
753ca4f312 [PATCH] fix copy_process() error check
The return value of copy_process() should be checked by IS_ERR().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-25 13:28:34 -08:00
Linus Torvalds
b42172fc7b Don't call "note_interrupt()" with irq descriptor lock held
This reverts commit f72fa70760, and solves
the problem that it tried to fix by simply making "__do_IRQ()" call the
note_interrupt() function without the lock held, the way everybody else
does.

It should be noted that all interrupt handling code must never allow the
descriptor actors to be entered "recursively" (that's why we do all the
magic IRQ_PENDING stuff in the first place), so there actually is
exclusion at that much higher level, even in the absense of locking.

Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Acked-by:Pavel Emelianov <xemul@openvz.org>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-22 09:32:06 -08:00
David Howells
c4028958b6 WorkStruct: make allyesconfig
Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:57:56 +00:00
David Howells
65f27f3844 WorkStruct: Pass the work_struct pointer instead of context data
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct.  This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function.  This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated..  This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems.  But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default.  Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:55:48 +00:00
David Howells
365970a1ea WorkStruct: Merge the pending bit into the wq_data pointer
Reclaim a word from the size of the work_struct by folding the pending bit and
the wq_data pointer together.  This shouldn't cause misalignment problems as
all pointers should be at least 4-byte aligned.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:49 +00:00
David Howells
6bb49e5965 WorkStruct: Typedef the work function prototype
Define a type for the work function prototype.  It's not only kept in the
work_struct struct, it's also passed as an argument to several functions.

This makes it easier to change it.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:45 +00:00
David Howells
52bad64d95 WorkStruct: Separate delayable and non-delayable events.
Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.

The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
architecture it's nearly 100 bytes in size.  This reduces that by half for the
non-delayable type of event.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:01 +00:00
Ingo Molnar
1ff5683043 [PATCH] lockdep: fix static keys in module-allocated percpu areas
lockdep got confused by certain locks in modules:

 INFO: trying to register non-static key.
 the code is fine but needs lockdep annotation.
 turning off the locking correctness validator.

 Call Trace:
  [<ffffffff8026f40d>] dump_trace+0xaa/0x3f2
  [<ffffffff8026f78f>] show_trace+0x3a/0x60
  [<ffffffff8026f9d1>] dump_stack+0x15/0x17
  [<ffffffff802abfe8>] __lock_acquire+0x724/0x9bb
  [<ffffffff802ac52b>] lock_acquire+0x4d/0x67
  [<ffffffff80267139>] rt_spin_lock+0x3d/0x41
  [<ffffffff8839ed3f>] :ip_conntrack:__ip_ct_refresh_acct+0x131/0x174
  [<ffffffff883a1334>] :ip_conntrack:udp_packet+0xbf/0xcf
  [<ffffffff8839f9af>] :ip_conntrack:ip_conntrack_in+0x394/0x4a7
  [<ffffffff8023551f>] nf_iterate+0x41/0x7f
  [<ffffffff8025946a>] nf_hook_slow+0x64/0xd5
  [<ffffffff802369a2>] ip_rcv+0x24e/0x506
  [...]

Steven Rostedt found the bug: static_obj() check did not take
PERCPU_ENOUGH_ROOM into account, so in-module DEFINE_PER_CPU-area locks
were triggering this message.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-17 11:10:37 -08:00
Zhang, Yanmin
b86432b42e [PATCH] some irq_chip variables point to NULL
I got an oops when booting 2.6.19-rc5-mm1 on my ia64 machine.

Below is the log.

Oops 11012296146944 [1]
Modules linked in: binfmt_misc dm_mirror dm_multipath dm_mod thermal processor f
an container button sg eepro100 e100 mii

Pid: 0, CPU 0, comm:              swapper
psr : 0000121008022038 ifs : 800000000000040b ip  : [<a0000001000e1411>]    Not
tainted
ip is at __do_IRQ+0x371/0x3e0
unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003
rnat: 656960155aa56aa5 bsps: a00000010058b890 pr  : 656960155aa55a65
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001000e1390 b6  : a0000001005beac0 b7  : e00000007f01aa00
f6  : 000000000000000000000 f7  : 0ffe69090000000000000
f8  : 1000a9090000000000000 f9  : 0ffff8000000000000000
f10 : 1000a908ffffff6f70000 f11 : 1003e0000000000000909
r1  : a000000100fbbff0 r2  : 0000000000010002 r3  : 0000000000010001
r8  : fffffffffffbffff r9  : a000000100bd8060 r10 : a000000100dd83b8
r11 : fffffffffffeffff r12 : a000000100bcbbb0 r13 : a000000100bc4000
r14 : 0000000000010000 r15 : 0000000000010000 r16 : a000000100c01aa8
r17 : a000000100d2c350 r18 : 0000000000000000 r19 : a000000100d2c300
r20 : a000000100c01a88 r21 : 0000000080010100 r22 : a000000100c01ac0
r23 : a0000001000108e0 r24 : e000000477980004 r25 : 0000000000000000
r26 : 0000000000000000 r27 : e00000000913400c r28 : e0000004799ee51c
r29 : e0000004778b87f0 r30 : a000000100d2c300 r31 : a00000010005c7e0

Call Trace:
 [<a000000100014600>] show_stack+0x40/0xa0
                                sp=a000000100bcb760 bsp=a000000100bc4f40
 [<a000000100014f00>] show_regs+0x840/0x880
                                sp=a000000100bcb930 bsp=a000000100bc4ee8
 [<a000000100037fb0>] die+0x250/0x320
                                sp=a000000100bcb930 bsp=a000000100bc4ea0
 [<a00000010005e5f0>] ia64_do_page_fault+0x8d0/0xa20
                                sp=a000000100bcb950 bsp=a000000100bc4e50
 [<a00000010000caa0>] ia64_leave_kernel+0x0/0x290
                                sp=a000000100bcb9e0 bsp=a000000100bc4e50
 [<a0000001000e1410>] __do_IRQ+0x370/0x3e0
                                sp=a000000100bcbbb0 bsp=a000000100bc4df0
 [<a000000100011f50>] ia64_handle_irq+0x170/0x220
                                sp=a000000100bcbbb0 bsp=a000000100bc4dc0
 [<a00000010000caa0>] ia64_leave_kernel+0x0/0x290
                                sp=a000000100bcbbb0 bsp=a000000100bc4dc0
 [<a000000100012390>] ia64_pal_call_static+0x90/0xc0
                                sp=a000000100bcbd80 bsp=a000000100bc4d78
 [<a000000100015630>] default_idle+0x90/0x160
                                sp=a000000100bcbd80 bsp=a000000100bc4d58
 [<a000000100014290>] cpu_idle+0x1f0/0x440
                                sp=a000000100bcbe20 bsp=a000000100bc4d18
 [<a000000100009980>] rest_init+0xc0/0xe0
                                sp=a000000100bcbe20 bsp=a000000100bc4d00
 [<a0000001009f8ea0>] start_kernel+0x6a0/0x6c0
                                sp=a000000100bcbe20 bsp=a000000100bc4ca0
 [<a0000001000089f0>] __end_ivt_text+0x6d0/0x6f0
                                sp=a000000100bcbe30 bsp=a000000100bc4c00
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

The root cause is that some irq_chip variables, especially ia64_msi_chip,
initiate their memeber end to point to NULL. __do_IRQ doesn't check
if irq_chip->end is null and just calls it after processing the interrupt.

As irq_chip->end is called at many places, so I fix it by reinitiating
irq_chip->end to dummy_irq_chip.end, e.g., a noop function.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-16 11:43:37 -08:00
Linus Torvalds
9a3a04ac38 Revert "[PATCH] fix Data Acess error in dup_fd"
This reverts commit 0130b0b32e.

Sergey Vlasov points out (and Vadim Lobanov concurs) that the bug it was
supposed to fix must be some unrelated memory corruption, and the "fix"
actually causes more problems:

  "However, the new code does not look safe in all cases.  If some other
   task has opened more files while dup_fd() released oldf->file_lock, the
   new code will update open_files to the new larger value.  But newf was
   allocated with the old smaller value of open_files, therefore subsequent
   accesses to newf may try to write into unallocated memory."

so revert it.

Cc: Sharyathi Nagesh <sharyath@in.ibm.com>
Cc: Sergey Vlasov <vsu@altlinux.ru>
Cc: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-14 15:20:51 -08:00
Andrew Morton
8b126b7753 [PATCH] setup_irq(): better mismatch debugging
When we get a mismatch between handlers on the same IRQ, all we get is "IRQ
handler type mismatch for IRQ n".  Let's print the name of the
presently-registered handler with which we got the mismatch.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-14 09:09:26 -08:00
Pavel Emelianov
f72fa70760 [PATCH] Fix misrouted interrupts deadlocks
While testing kernel on machine with "irqpoll" option I've caught such a
lockup:

	__do_IRQ()
	   spin_lock(&desc->lock);
           desc->chip->ack(); /* IRQ is ACKed */
	note_interrupt()
	misrouted_irq()
	handle_IRQ_event()
           if (...)
	      local_irq_enable_in_hardirq();
	/* interrupts are enabled from now */
	...
	__do_IRQ() /* same IRQ we've started from */
	   spin_lock(&desc->lock); /* LOCKUP */

Looking at misrouted_irq() code I've found that a potential deadlock like
this can also take place:

1CPU:
__do_IRQ()
   spin_lock(&desc->lock); /* irq = A */
misrouted_irq()
   for (i = 1; i < NR_IRQS; i++) {
      spin_lock(&desc->lock); /* irq = B */
      if (desc->status & IRQ_INPROGRESS) {

2CPU:
__do_IRQ()
   spin_lock(&desc->lock); /* irq = B */
misrouted_irq()
   for (i = 1; i < NR_IRQS; i++) {
      spin_lock(&desc->lock); /* irq = A */
      if (desc->status & IRQ_INPROGRESS) {

As the second lock on both CPUs is taken before checking that this irq is
being handled in another processor this may cause a deadlock.  This issue
is only theoretical.

I propose the attached patch to fix booth problems: when trying to handle
misrouted IRQ active desc->lock may be unlocked.

Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13 07:40:43 -08:00
Sharyathi Nagesh
0130b0b32e [PATCH] fix Data Acess error in dup_fd
On running the Stress Test on machine for more than 72 hours following
error message was observed.

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
    pc: c000000000060d90: .dup_fd+0x240/0x39c
    lr: c000000000060d6c: .dup_fd+0x21c/0x39c
    sp: c00000007ce2fa70
   msr: 800000000000b032
   dar: ffffffff00000028
 dsisr: 40000000
  current = 0xc000000074950980
  paca    = 0xc000000000454500
    pid   = 27330, comm = bash

0:mon> t
[c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
[c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
[c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
[c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
[c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
[c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc

The problem is because of race window.  When if(expand) block is executed in
dup_fd unlocking of oldf->file_lock give a window for fdtable in oldf to be
modified.  So actual open_files in oldf may not match with open_files
variable.

Cc: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13 07:40:43 -08:00
Eric W. Biederman
d99f160ac5 [PATCH] sysctl: allow a zero ctl_name in the middle of a sysctl table
Since it is becoming clear that there are just enough users of the binary
sysctl interface that completely removing the binary interface from the kernel
will not be an option for foreseeable future, we need to find a way to address
the sysctl maintenance issues.

The basic problem is that sysctl requires one central authority to allocate
sysctl numbers, or else conflicts and ABI breakage occur.  The proc interface
to sysctl does not have that problem, as names are not densely allocated.

By not terminating a sysctl table until I have neither a ctl_name nor a
procname, it becomes simple to add sysctl entries that don't show up in the
binary sysctl interface.  Which allows people to avoid allocating a binary
sysctl value when not needed.

I have audited the kernel code and in my reading I have not found a single
sysctl table that wasn't terminated by a completely zero filled entry.  So
this change in behavior should not affect anything.

I think this mechanism eases the pain enough that combined with a little
disciple we can solve the reoccurring sysctl ABI breakage.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:23 -08:00
Eric W. Biederman
0e009be8a0 [PATCH] Improve the removed sysctl warnings
Don't warn about libpthread's access to kernel.version.  When it receives
-ENOSYS it will read /proc/sys/kernel/version.

If anything else shows up print the sysctl number string.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cal Peake <cp@absolutedigital.net>
Cc: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:23 -08:00
Peter Zijlstra
64efade11c [PATCH] lockdep: fix delayacct locking bug
Make the delayacct lock irqsave; this avoids the possible deadlock where
an interrupt is taken while holding the delayacct lock which needs to
take the delayacct lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:23 -08:00
Gautham R Shenoy
4b96b1a10c [PATCH] Fix the spurious unlock_cpu_hotplug false warnings
Cpu-hotplug locking has a minor race case caused because of setting the
variable "recursive" to NULL *after* releasing the cpu_bitmask_lock in the
function unlock_cpu_hotplug,instead of doing so before releasing the
cpu_bitmask_lock.

This was the cause of most of the recent false spurious lock_cpu_unlock
warnings.

This should fix the problem reported by Martin Lorenz reported in
http://lkml.org/lkml/2006/10/29/127.

Thanks to Srinivasa DS for pointing it out.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:22 -08:00
Linus Torvalds
10b1fbdb0a Make sure "user->sigpending" count is in sync
The previous commit (45c18b0bb5, aka "Fix
unlikely (but possible) race condition on task->user access") fixed a
potential oops due to __sigqueue_alloc() getting its "user" pointer out
of sync with switch_user(), and accessing a user pointer that had been
de-allocated on another CPU.

It still left another (much less serious) problem, where a concurrent
__sigqueue_alloc and swich_user could cause sigqueue_alloc to do signal
pending reference counting for a _different_ user than the one it then
actually ended up using.  No oops, but we'd end up with the wrong signal
accounting.

Another case of Oleg's eagle-eyes picking up the problem.

This is trivially fixed by just making sure we load whichever "user"
structure we decide to use (it doesn't matter _which_ one we pick, we
just need to pick one) just once.

Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-04 13:03:00 -08:00
Linus Torvalds
45c18b0bb5 Fix unlikely (but possible) race condition on task->user access
There's a possible race condition when doing a "switch_uid()" from one
user to another, which could race with another thread doing a signal
allocation and looking at the old thread ->user pointer as it is freed.

This explains an oops reported by Lukasz Trabinski:
	http://permalink.gmane.org/gmane.linux.kernel/462241

We fix this by delaying the (reference-counted) freeing of the user
structure until the thread signal handler lock has been released, so
that we know that the signal allocation has either seen the new value or
has properly incremented the reference count of the old one.

Race identified by Oleg Nesterov.

Cc: Lukasz Trabinski <lukasz@wsisiz.edu.pl>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-04 10:06:02 -08:00
Stephen Rothwell
3fd5939798 [PATCH] Create compat_sys_migrate_pages
This is needed on bigendian 64bit architectures.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:59 -08:00
Rafael J. Wysocki
b918f6e62c [PATCH] swsusp: debugging
Add a swsusp debugging mode.  This does everything that's needed for a suspend
except for actually suspending.  So we can look in the log messages and work
out a) what code is being slow and b) which drivers are misbehaving.

(1)
# echo testproc > /sys/power/disk
# echo disk > /sys/power/state

This should turn off the non-boot CPU, freeze all processes, wait for 5
seconds and then thaw the processes and the CPU.

(2)
# echo test > /sys/power/disk
# echo disk > /sys/power/state

This should turn off the non-boot CPU, freeze all processes, shrink
memory, suspend all devices, wait for 5 seconds, resume the devices etc.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: Stefan Seyfried <seife@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
Andrew Morton
19c6b6ed3f [PATCH] schedule removal of FUTEX_FD
Apparently FUTEX_FD is unfixably racy and nothing uses it (or if it does, it
shouldn't).

Add a warning printk, give any remaining users six months to migrate off it.

Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
Andrew Morton
f46c483357 [PATCH] Add printk_timed_ratelimit()
printk_ratelimit() has global state which makes it not useful for callers
which wish to perform ratelimiting at a particular frequency.

Add a printk_timed_ratelimit() which utilises caller-provided state storage to
permit more flexibility.

This function can in fact be used for things other than printk ratelimiting
and is perhaps poorly named.

Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
Rafael J. Wysocki
9185cfa925 ACPI: S4: Use "platform" rather than "shutdown" mode by default
It has been reported that on some systems the functionality after a resume
from disk is limited if the system is simply powered off during the suspend
instead of using the ACPI S4 suspend (aka platform mode).

Unfortunately the default is currently to power off the system during the
suspend so the users of these systems experience problems after the resume
if they don't switch to the platform mode explicitly.  This patch makes swsusp
use the platform mode by default to avoid such situations.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Stefan Seyfried <seife@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2006-11-01 22:40:23 -05:00
Oleg Nesterov
4a279ff1ea [PATCH] taskstats: fix sub-threads accounting
If there are no listeners, taskstats_exit_send() just returns because
taskstats_exit_alloc() didn't allocate *tidstats.  This is wrong, each
sub-thread should do fill_tgid_exit() on exit, otherwise its ->delays is
not recorded in ->signal->stats and lost.

Q: We don't send TASKSTATS_TYPE_AGGR_TGID when single-threaded process
exits.  Is it good?  How can the listener figure out that it was actually a
process exit, not sub-thread?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-31 08:07:00 -08:00
Oleg Nesterov
f0ec1aaf54 [PATCH] xacct_add_tsk: fix pure theoretical ->mm use-after-free
Paranoid fix. The task can free its ->mm after the 'if (p->mm)' check.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 12:08:41 -08:00
Randy Dunlap
0e2d57fc6e [PATCH] ndiswrapper: don't set the module->taints flags
For ndiswrapper, don't set the module->taints flags, just set the kernel
global tainted flag.  This should allow ndiswrapper to continue to use GPL
symbols.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Florin Malita <fmalita@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 12:08:40 -08:00
Oleg Nesterov
3d8334def5 [PATCH] taskstats: fix sk_buff size calculation
prepare_reply() adds GENL_HDRLEN to the payload (genlmsg_total_size()),
but then it does genlmsg_put()->nlmsg_put().  This means we forget to
reserve a room for 'struct nlmsghdr'.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-29 12:07:37 -08:00
Oleg Nesterov
d46a3d0d07 [PATCH] taskstats: fix sk_buff leak
'return genlmsg_cancel()' in taskstats_user_cmd/taskstats_exit_send
potentially leaks a skb.  Unless we pass 'rep_skb' to the netlink layer
we own sk_buff.  This means we should always do kfree_skb() on failure.

[ Thomas acked and pointed out missing return value in original version ]

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Thomas Graf <tgraf@suug.ch>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-29 12:07:37 -08:00
Alan Stern
057647fc47 [PATCH] workqueue: update kerneldoc
This patch (as812) changes the kerneldoc comments explaining the return
values from queue_work(), queue_delayed_work(), and
queue_delayed_work_on().  The updated comments explain more accurately the
meaning of the return code and avoid suggesting that a 0 value means the
routine was unsuccessful.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Satoru Takeuchi
8fa1d7d3b2 [PATCH] cpu-hotplug: release `workqueue_mutex' properly on CPU hot-remove
_cpu_down() acquires `workqueue_mutex' on its process, but doen't release it
if __cpu_disable() fails.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Jim Houston
bb1d860551 [PATCH] time_adjust cleared before use
I notice that the code which implements adjtime clears the time_adjust
value before using it.  The attached patch makes the obvious fix.

Acked-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Jim Houston <jim.houston@ccur.com>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Oleg Nesterov
d7c3f5f231 [PATCH] fill_tgid: cleanup delays accounting
fill_tgid() should skip not only an already exited group leader.  If the
task has ->exit_state != 0 it already did exit_notify(), so it also did
fill_tgid_exit()->delayacct_add_tsk(->signal->stats) and we should skip it
to avoid a double accounting.

This patch doesn't close the race completely, but it cleanups the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Oleg Nesterov
a98b609426 [PATCH] taskstats: don't use tasklist_lock
Remove tasklist_lock from taskstats.c. find_task_by_pid() is rcu-safe.
->siglock allows us to traverse subthread without tasklist.

Q: delay accounting looks wrong to me.  If sub-thread has already called
taskstats_exit_send() but didn't call release_task(self) yet it will be
accounted twice.  The window is big.  No?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
b8534d7bd8 [PATCH] taskstats: kill ->taskstats_lock in favor of ->siglock
signal_struct is (mostly) protected by ->sighand->siglock, I think we don't
need ->taskstats_lock to protect ->stats.  This also allows us to simplify the
locking in fill_tgid().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
093a8e8aec [PATCH] taskstats_tgid_free: fix usage
taskstats_tgid_free() is called on copy_process's error path. This is wrong.

	IF (clone_flags & CLONE_THREAD)
		We should not clear ->signal->taskstats, current uses it,
		it probably has a valid accumulated info.
	ELSE
		taskstats_tgid_init() set ->signal->taskstats = NULL,
		there is nothing to free.

Move the callsite to __exit_signal(). We don't need any locking, entire
thread group is exiting, nobody should have a reference to soon to be
released ->signal.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
05d5bcd60e [PATCH] bacct_add_tsk: fix unsafe and wrong parent/group_leader dereference
1. ts = timespec_sub(uptime, current->group_leader->start_time);

   It is possible that current != tsk. Probably it was supposed
   to be 'tsk->group_leader->start_time. But why we are reading
   group_leader's start_time ? This accounting is per thread,
   not per procees, I changed this to 'tsk->start_time.
   Please corect me.

2. stats->ac_ppid = (tsk->parent) ? tsk->parent->pid : 0;

   tsk->parent never == NULL, and it is unsafe to dereference it.
   Both the task and it's parent may exit after the caller unlocks
   tasklist_lock, the memory could be unmapped (DEBUG_SLAB).
   (And we should use ->real_parent->tgid in fact).

Q: I don't understand the 'if (thread_group_leader(tsk))' check.
Why it is needed ?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
fca178c0c6 [PATCH] fill_tgid: fix task_struct leak and possible oops
1. fill_tgid() forgets to do put_task_struct(first).

2. release_task(first) can happen after fill_tgid() drops tasklist_lock,
   it is unsafe to dereference first->signal.

This is a temporary fix, imho the locking should be reworked.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Stephen Rothwell
5fa3839a64 [PATCH] Constify compat_get_bitmap argument
This means we can call it when the bitmap we want to fetch is declared
const.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Jan Dittmer
1d4d262769 [PATCH] Add missing space in module.c for taintskernel
Obvious fix.

Signed-off-by: Jan Dittmer <jdi@l4x.org>
Acked-by: Florin Malita <fmalita@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:53 -07:00
Jan Beulich
690a973f48 [PATCH] x86-64: Speed up dwarf2 unwinder
This changes the dwarf2 unwinder to do a binary search for CIEs
instead of a linear work. The linker is unfortunately not
able to build a proper lookup table at link time, instead it creates
one at runtime as soon as the bootmem allocator is usable (so you'll continue
using the linear lookup for the first [hopefully] few calls).
The code should be ready to utilize a build-time created table once
a fixed linker becomes available.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-10-21 18:37:01 +02:00
Alexey Dobriyan
e05d722e45 [PATCH] kernel/nsproxy.c: use kmemdup()
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:44 -07:00
Randy Dunlap
d6f8ff7381 [PATCH] cad_pid sysctl with PROC_FS=n
If CONFIG_PROC_FS=n:

kernel/sysctl.c:148: warning: 'proc_do_cad_pid' used but never defined
kernel/built-in.o:(.data+0x1228): undefined reference to `proc_do_cad_pid'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:38 -07:00
Borislav Petkov
91fcdd4e03 [PATCH] readjust comments of task_timeslice for kernel doc
Signed-off-by: Borislav Petkov <petkov@math.uni-muenster.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:37 -07:00
Linus Torvalds
43f82216f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: fm801-gp - handle errors from pci_enable_device()
  Input: gameport core - handle errors returned by device_bind_driver()
  Input: serio core - handle errors returned by device_bind_driver()
  Lockdep: fix compile error in drivers/input/serio/serio.c
  Input: serio - add lockdep annotations
  Lockdep: add lockdep_set_class_and_subclass() and lockdep_set_subclass()
  Input: atkbd - supress "too many keys" error message
  Input: i8042 - supress ACK/NAKs when blinking during panic
  Input: add missing exports to fix modular build
2006-10-17 08:56:43 -07:00
Neil Brown
bd5349cfd2 [PATCH] Convert cpu hotplug notifiers to use raw_notifier instead of blocking_notifier
The use of blocking notifier by _cpu_up and _cpu_down in cpu.c has two
problem.

1/ An interaction with the workqueue notifier causes lockdep to spit a
   warning.

2/ A notifier could conceivable be added or removed while _cpu_up or
   _cpu_down are in process.  As each notifier is called twice (prepare
   then commit/abort) this could be unhealthy.

To fix to we simply take cpu_add_remove_lock while adding or removing
notifiers to/from the list.

This makes the 'blocking' usage unnecessary as all accesses to cpu_chain
are now protected by cpu_add_remove_lock.  So change "blocking" to "raw" in
all relevant places.  This fixes 1.

Credit: Andrew Morton
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> (reporter)
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:48 -07:00
Peter Zijlstra
bea493a031 [PATCH] rt-mutex: fixup rt-mutex debug code
BUG: warning at kernel/rtmutex-debug.c:125/rt_mutex_debug_task_free() (Not tainted)
 [<c04051e3>] show_trace_log_lvl+0x58/0x16a
 [<c04057f0>] show_trace+0xd/0x10
 [<c0405900>] dump_stack+0x19/0x1b
 [<c043f03d>] rt_mutex_debug_task_free+0x35/0x6a
 [<c04224c0>] free_task+0x15/0x24
 [<c042378c>] copy_process+0x12bd/0x1324
 [<c0423835>] do_fork+0x42/0x113
 [<c04021dd>] sys_fork+0x19/0x1b
 [<c0403fb7>] syscall_call+0x7/0xb

In copy_process(), dup_task_struct() also duplicates the ->pi_lock,
->pi_waiters and ->pi_blocked_on members.  rt_mutex_debug_task_free()
called from free_task() validates these members.  However free_task() can
be invoked before these members are reset for the new task.

Move the initialization code before the first bail that can hit free_task().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:48 -07:00
Ingo Molnar
a460e745e8 [PATCH] genirq: clean up irq-flow-type naming
Introduce desc->name and eliminate the handle_irq_name() hack.  Add
set_irq_chip_and_handler_name() to set the flow type and name at once.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:45 -07:00
Andrew Morton
c60099bfe3 [PATCH] swsusp: fix memory leaks
My fancy new swsusp IO code had a big memory leak.  It's somewhat invisible
because the whole mem_map[] gets overwritten after resume, but it can cause us
to get low on memory during the actual suspend process.

Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:44 -07:00
Thomas Gleixner
ac08c26492 [PATCH] posix-cpu-timers: prevent signal delivery starvation
The integer divisions in the timer accounting code can round the result
down to 0.  Adding 0 is without effect and the signal delivery stops.

Clamp the division result to minimum 1 to avoid this.

Problem was reported by Seongbae Park <spark@google.com>, who provided
also an inital patch.

Roland sayeth:

  I have had some more time to think about the problem, and to reproduce it
  using Toyo's test case.  For the record, if my understanding of the problem
  is correct, this happens only in one very particular case.  First, the
  expiry time has to be so soon that in cputime_t units (usually 1s/HZ ticks)
  it's < nthreads so the division yields zero.  Second, it only affects each
  thread that is so new that its CPU time accumulation is zero so now+0 is
  still zero and ->it_*_expires winds up staying zero.  For the VIRT and PROF
  clocks when cputime_t is tick granularity (or the SCHED clock on
  configurations where sched_clock's value only advances on clock ticks), this
  is not hard to arrange with new threads starting up and blocking before they
  accumulate a whole tick of CPU time.  That's what happens in Toyo's test
  case.

  Note that in general it is fine for that division to round down to zero,
  and set each thread's expiry time to its "now" time.  The problem only
  arises with thread's whose "now" value is still zero, so that now+0 winds up
  0 and is interpreted as "not set" instead of ">= now".  So it would be a
  sufficient and more precise fix to just use max(ticks, 1) inside the loop
  when setting each it_*_expires value.

  But, it does no harm to round the division up to one and always advance
  every thread's expiry time.  If the thread didn't already fire timers for
  the expiry time of "now", there is no expectation that it will do so before
  the next tick anyway.  So I followed Thomas's patch in lifting the max out
  of the loops.

  This patch also covers the reload cases, which are harder to write a test
  for (and I didn't try).  I've tested it with Toyo's case and it fixes that.

[toyoa@mvista.com: fix: min_t -> max_t]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Daniel Walker <dwalker@mvista.com>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Seongbae Park <spark@google.com>
Cc: Peter Mattis <pmattis@google.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:43 -07:00
john stultz
3f4a0b917c [PATCH] i386 Time: Avoid PIT SMP lockups
Avoid possible PIT livelock issues seen on SMP systems (and reported by
Andi), by not allowing it as a clocksource on SMP boxes.

However, since the PIT may no longer be present, we have to properly handle
the cases where SMP systems have TSC skew and fall back from the TSC.
Since the PIT isn't there, it would "fall back" to the TSC again.  So this
changes the jiffies rating to 1, and the TSC-bad rating value to 0.

Thus you will get the following behavior priority on i386 systems:

tsc		[if present & stable]
hpet		[if present]
cyclone		[if present]
acpi_pm		[if present]
pit		[if UP]
jiffies

Rather then the current more complicated:
tsc		[if present & stable]
hpet		[if present]
cyclone		[if present]
acpi_pm		[if present]
pit		[if cpus < 4]
tsc		[if present & unstable]
jiffies

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:42 -07:00
Ingo Molnar
ca268c691d [PATCH] lockdep: increase max allowed recursion depth
In general, lockdep warnings are intended to be non-fatal, so I have put in
various practical limits on internal data structure failure modes.  We haven't
had a /single/ lockdep-internal crash ever since lockdep went upstream [the
unwinder crashes are outside of lockdep], and that's largely due to the good
internal checks it does.

Recursion within the dependency graph is currently limited to 20, that's
probably not enough on some many-CPU boxes - this patch doubles it to 40.  I
have written the lockdep functions to have as small stackframes as possible,
so 40 should be OK too.  (The practical recursion limit should be somewhere
between 100 and 200 entries.  If we hit that then I'll change the algorithm to
be iteration-based.  Graph walking logic is so easy to program via recursion,
so i'd like to keep recursion as long as possible.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:42 -07:00
Randy Dunlap
39af114377 [PATCH] fix epoll_pwait when EPOLL=n
Fixes http://bugzilla.kernel.org/show_bug.cgi?id=7371

sys_epoll_pwait needs to be listed as a conditional (weak)
entry point for CONFIG_EPOLL=n.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-16 09:14:05 -07:00
Ingo Molnar
256a6b4136 [PATCH] lockdep: fix printk recursion logic
Bug reported and fixed by Tilman Schmidt <tilman@imap.cc>: if lockdep is
enabled then log messages make it to /var/log/messages belatedly.  The
reason is a missed wakeup of klogd.

Initially there was only a lockdep_internal() protection against lockdep
recursion within vprintk() - it grew the 'outer' lockdep_off()/on()
protection only later on.  But that lockdep_off() made the
release_console_sem() within vprintk() always happen under the
lockdep_internal() condition, causing the bug.

The right solution to remove the inner protection against recursion here -
the outer one is enough.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:24 -07:00
Alexey Dobriyan
3dc3099a9b [PATCH] lockdep: use BUILD_BUG_ON
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:24 -07:00
Reinette Chatre
01a3ee2b20 [PATCH] bitmap: parse input from kernel and user buffers
lib/bitmap.c:bitmap_parse() is a library function that received as input a
user buffer.  This seemed to have originated from the way the write_proc
function of the /proc filesystem operates.

This has been reworked to not use kmalloc and eliminates a lot of
get_user() overhead by performing one access_ok before using __get_user().

We need to test if we are in kernel or user space (is_user) and access the
buffer differently.  We cannot use __get_user() to access kernel addresses
in all cases, for example in architectures with separate address space for
kernel and user.

This function will be useful for other uses as well; for example, taking
input for /sysfs instead of /proc, so it was changed to accept kernel
buffers.  We have this use for the Linux UWB project, as part as the
upcoming bandwidth allocator code.

Only a few routines used this function and they were changed too.

Signed-off-by: Reinette Chatre <reinette.chatre@linux.intel.com>
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:22 -07:00
Nick Piggin
beed33a816 [PATCH] sched: likely profiling
This likely profiling is pretty fun. I found a few possible problems
in sched.c.

This patch may be not measurable, but when I did measure long ago,
nooping (un)likely cost a couple of % on scheduler heavy benchmarks, so
it all adds up.

Tweak some branch hints:

- the 2nd 64 bits in the bitmask is likely to be populated, because it
  contains the first 28 bits (nearly 3/4) of the normal priorities.
  (ratio of 669669:691 ~= 1000:1).

- it isn't unlikely that context switching switches to another process. it
  might be very rapidly switching to and from the idle process (ratio of
  475815:419004 and 471330:423544). Let the branch predictor decide.

- preempt_enable seems to be very often called in a nested preempt_disable
  or with interrupts disabled (ratio of 3567760:87965 ~= 40:1)

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Daniel Walker <dwalker@mvista.com>
Cc: Hua Zhong <hzhong@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:22 -07:00
Florin Malita
fa3ba2e81e [PATCH] fix Module taint flags listing in Oops/panic
Module taint flags listing in Oops/panic has a couple of issues:

* taint_flags() doesn't null-terminate the buffer after printing the flags

* per-module taints are only set if the kernel is not already tainted
  (with that particular flag) => only the first offending module gets its
  taint info correctly updated

Some additional changes:

* 'license_gplok' is no longer needed - equivalent to !(taints &
  TAINT_PROPRIETARY_MODULE) - so we can drop it from struct module *
  exporting module taint info via /proc/module:

pwc 88576 0 - Live 0xf8c32000
evilmod 6784 1 pwc, Live 0xf8bbf000 (PF)

Signed-off-by: Florin Malita <fmalita@gmail.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:21 -07:00
Christoph Lameter
469340236a [PATCH] mm: kevent threads: use MPOL_DEFAULT
Switch the memory policy of the kevent threads to MPOL_DEFAULT while
leaving the kzalloc of the workqueue structure on interleave.  This means
that all code executed in the context of the kevent thread is allocating
node local.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Alok Kataria <alok.kataria@calsoftinc.com>
Cc: Andi Kleen <ak@suse.de>
Cc: <pj@sgi.com>
Cc: <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Rafael J. Wysocki
97c7801cd5 [PATCH] swsusp: Use suspend_console
Add suspend_console() and resume_console() to the suspend-to-disk code paths
so that the users of netconsole can use swsusp with it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:14 -07:00
Peter Zijlstra
4dfbb9d8c6 Lockdep: add lockdep_set_class_and_subclass() and lockdep_set_subclass()
This annotation makes it possible to assign a subclass on lock init. This
annotation is meant to reduce the _nested() annotations by assigning a
default subclass.

One could do without this annotation and rely on lockdep_set_class()
exclusively, but that would require a manual stack of struct lock_class_key
objects.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2006-10-11 01:45:14 -04:00
Al Viro
1af9892811 [PATCH] cpuset ANSI prototype
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-10 15:37:23 -07:00
Al Viro
ba2397efe1 [PATCH] make kernel/relay.c __user-clean
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-10 15:37:22 -07:00
Al Viro
ba46df984b [PATCH] __user annotations: futex
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-10 15:37:22 -07:00
Rafael J. Wysocki
5c339d4541 [PATCH] swsusp: Make userland suspend work on SMP again
Unfortunately one of the recent changes in swsusp has broken the userland
suspend on SMP.  Fix it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-07 10:51:14 -07:00
Frederik Deweerdt
e317c8ccaa [PATCH] ixp4xxdefconfig arm fixes
With the following patch, the ixp4xxdefconfig builds correctly.  I'll
test some more configs if I get some time.

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 12:11:08 -07:00
Andrew Morton
4899b8b16b [PATCH] kauditd_thread warning fix
Squash this warning:

  kernel/audit.c: In function 'kauditd_thread':
  kernel/audit.c:367: warning: no return statement in function returning non-void

We might as test kthread_should_stop(), although it's not very pointful at
present.

The code which starts this thread looks racy - the kernel could start multiple
threads.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 08:53:39 -07:00
David Howells
7d12e780e0 IRQ: Maintain regs pointer globally rather than passing to IRQ handlers
Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
of passing regs around manually through all ~1800 interrupt handlers in the
Linux kernel.

The regs pointer is used in few places, but it potentially costs both stack
space and code to pass it around.  On the FRV arch, removing the regs parameter
from all the genirq function results in a 20% speed up of the IRQ exit path
(ie: from leaving timer_interrupt() to leaving do_IRQ()).

Where appropriate, an arch may override the generic storage facility and do
something different with the variable.  On FRV, for instance, the address is
maintained in GR28 at all times inside the kernel as part of general exception
handling.

Having looked over the code, it appears that the parameter may be handed down
through up to twenty or so layers of functions.  Consider a USB character
device attached to a USB hub, attached to a USB controller that posts its
interrupts through a cascaded auxiliary interrupt controller.  A character
device driver may want to pass regs to the sysrq handler through the input
layer which adds another few layers of parameter passing.

I've build this code with allyesconfig for x86_64 and i386.  I've runtested the
main part of the code on FRV and i386, though I can't test most of the drivers.
I've also done partial conversion for powerpc and MIPS - these at least compile
with minimal configurations.

This will affect all archs.  Mostly the changes should be relatively easy.
Take do_IRQ(), store the regs pointer at the beginning, saving the old one:

	struct pt_regs *old_regs = set_irq_regs(regs);

And put the old one back at the end:

	set_irq_regs(old_regs);

Don't pass regs through to generic_handle_irq() or __do_IRQ().

In timer_interrupt(), this sort of change will be necessary:

	-	update_process_times(user_mode(regs));
	-	profile_tick(CPU_PROFILING, regs);
	+	update_process_times(user_mode(get_irq_regs()));
	+	profile_tick(CPU_PROFILING);

I'd like to move update_process_times()'s use of get_irq_regs() into itself,
except that i386, alone of the archs, uses something other than user_mode().

Some notes on the interrupt handling in the drivers:

 (*) input_dev() is now gone entirely.  The regs pointer is no longer stored in
     the input_dev struct.

 (*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking.  It does
     something different depending on whether it's been supplied with a regs
     pointer or not.

 (*) Various IRQ handler function pointers have been moved to type
     irq_handler_t.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)
2006-10-05 15:10:12 +01:00
David Howells
da482792a6 IRQ: Typedef the IRQ handler function type
Typedef the IRQ handler function type.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1356d1e5fd256997e3d3dce0777ab787d0515c7a commit)
2006-10-05 13:28:27 +01:00
David Howells
57a58a9435 IRQ: Typedef the IRQ flow handler function type
Typedef the IRQ flow handler function type.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 8e973fbdf5716b93a0a8c0365be33a31ca0fa351 commit)
2006-10-05 13:28:06 +01:00
Linus Torvalds
fefd26b3b8 Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/configh
* master.kernel.org:/pub/scm/linux/kernel/git/davej/configh:
  Remove all inclusions of <linux/config.h>

Manually resolved trivial path conflicts due to removed files in
the sound/oss/ subdirectory.
2006-10-04 09:59:57 -07:00
Linus Torvalds
18e6756a6b Merge branch 'audit.b32' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b32' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] message types updated
  [PATCH] name_count array overrun
  [PATCH] PPID filtering fix
  [PATCH] arch filter lists with < or > should not be accepted
2006-10-04 08:15:55 -07:00
Oleg Nesterov
20e9751bd9 [PATCH] rcu: simplify/improve batch tuning
Kill a hard-to-calculate 'rsinterval' boot parameter and per-cpu
rcu_data.last_rs_qlen.  Instead, it adds adds a flag rcu_ctrlblk.signaled,
which records the fact that one of CPUs has sent a resched IPI since the
last rcu_start_batch().

Roughly speaking, we need two rcu_start_batch()s in order to move callbacks
from ->nxtlist to ->donelist.  This means that when ->qlen exceeds qhimark
and continues to grow, we should send a resched IPI, and then do it again
after we gone through a quiescent state.

On the other hand, if it was already sent, we don't need to do it again
when another CPU detects overflow of the queue.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
4b6c2cca6e [PATCH] rcu: add sched torture type to rcutorture
Implement torture testing for the "sched" variant of RCU, which uses
preempt_disable, preempt_enable, and synchronize_sched.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
11a147013e [PATCH] rcu: add rcu_bh_sync torture type to rcutorture
Use the newly-generic synchronous deferred free function to implement torture
testing for rcu_bh using synchronize_rcu_bh rather than the asynchronous
call_rcu_bh.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
20d2e4283a [PATCH] rcu: add rcu_sync torture type to rcutorture
Use the newly-generic synchronous deferred free function to implement torture
testing for RCU using synchronize_rcu rather than the asynchronous call_rcu.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
e303373658 [PATCH] rcu: refactor srcu_torture_deferred_free to work for any implementation
Make srcu_torture_deferred_free use cur_ops->sync() so it will work for any
implementation.  Move and rename it in preparation for use in the ops of other
implementations.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
b772e1dd4b [PATCH] RCU: add fake writers to rcutorture
rcutorture currently has one writer and an arbitrary number of readers.  To
better exercise some of the code paths in RCU implementations, add fake
writer threads which call the synchronize function for the RCU variant in a
loop, with a delay between calls to arrange for different numbers of
writers running in parallel.

[bunk@stusta.de: cleanup]
Acked-by: Paul McKenney <paulmck@us.ibm.com>
Cc: Dipkanar Sarma <dipankar@in.ibm.com>
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
75cfef32f2 [PATCH] rcu: Fix sign bug making rcu_random always return the same sequence
rcu_random uses a counter rrs_count to occasionally mix data from
get_random_bytes into the state of its pseudorandom generator.  However,
the rrs_counter gets declared as an unsigned long, and rcu_random checks
for --rrs_count < 0, so this code will never mix any real random data into
the state, and will thus always return the same sequence of random numbers.

Also, change the return value of rcu_random from long to unsigned long, to
avoid potential issues caused by the use of the % operator, which can
return negative values for negative left operands.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
2860aaba4d [PATCH] rcu: Avoid kthread_stop on invalid pointer if rcutorture reader startup fails
rcu_torture_init kmallocs the array of reader threads, then creates each
one with kthread_run, cleaning up with rcu_torture_cleanup if this fails.
rcu_torture_cleanup calls kthread_stop on any non-NULL pointer in the
array; however, any readers after the one that failed to start up will have
invalid pointers, not null pointers.  Avoid this by using kzalloc instead.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Josh Triplett
3c29e03d91 [PATCH] rcu: Mention rcu_bh in description of rcutorture's torture_type parameter
The comment for rcutorture's torture_type parameter only lists the RCU
variants rcu and srcu, but not rcu_bh; add rcu_bh to the list.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Josh Triplett
ff2c93a537 [PATCH] rcu: Add MODULE_AUTHOR to rcutorture module
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Alan Stern
e6a92013ba [PATCH] SRCU: report out-of-memory errors
Currently the init_srcu_struct() routine has no way to report out-of-memory
errors.  This patch (as761) makes it return -ENOMEM when the per-cpu data
allocation fails.

The patch also makes srcu_init_notifier_head() report a BUG if a notifier
head can't be initialized.  Perhaps it should return -ENOMEM instead, but
in the most likely cases where this might occur I don't think any recovery
is possible.  Notifier chains generally are not created dynamically.

[akpm@osdl.org: avoid statement-with-side-effect in macro]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Alan Stern
eabc069401 [PATCH] Add SRCU-based notifier chains
This patch (as751) adds a new type of notifier chain, based on the SRCU
(Sleepable Read-Copy Update) primitives recently added to the kernel.  An
SRCU notifier chain is much like a blocking notifier chain, in that it must
be called in process context and its callout routines are allowed to sleep.
 The difference is that the chain's links are protected by the SRCU
mechanism rather than by an rw-semaphore, so calling the chain has
extremely low overhead: no memory barriers and no cache-line bouncing.  On
the other hand, unregistering from the chain is expensive and the chain
head requires special runtime initialization (plus cleanup if it is to be
deallocated).

SRCU notifiers are appropriate for notifiers that will be called very
frequently and for which unregistration occurs very seldom.  The proposed
"task notifier" scheme qualifies, as may some of the network notifiers.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Paul E. McKenney
b2896d2e75 [PATCH] srcu-3: add SRCU operations to rcutorture
Adds SRCU operations to rcutorture and updates rcutorture documentation.
Also increases the stress imposed by the rcutorture test.

[bunk@stusta.de: make needlessly global code static]
Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Paul E. McKenney
621934ee7e [PATCH] srcu-3: RCU variant permitting read-side blocking
Updated patch adding a variant of RCU that permits sleeping in read-side
critical sections.  SRCU is as follows:

o	Each use of SRCU creates its own srcu_struct, and each
	srcu_struct has its own set of grace periods.  This is
	critical, as it prevents one subsystem with a blocking
	reader from holding up SRCU grace periods for other
	subsystems.

o	The SRCU primitives (srcu_read_lock(), srcu_read_unlock(),
	and synchronize_srcu()) all take a pointer to a srcu_struct.

o	The SRCU primitives must be called from process context.

o	srcu_read_lock() returns an int that must be passed to
	the matching srcu_read_unlock().  Realtime RCU avoids the
	need for this by storing the state in the task struct,
	but SRCU needs to allow a given code path to pass through
	multiple SRCU domains -- storing state in the task struct
	would therefore require either arbitrary space in the
	task struct or arbitrary limits on SRCU nesting.  So I
	kicked the state-storage problem up to the caller.

	Of course, it is not permitted to call synchronize_srcu()
	while in an SRCU read-side critical section.

o	There is no call_srcu().  It would not be hard to implement
	one, but it seems like too easy a way to OOM the system.
	(Hey, we have enough trouble with call_rcu(), which does
	-not- permit readers to sleep!!!)  So, if you want it,
	please tell me why...

[josht@us.ibm.com: sparse notation]
Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Eric W. Biederman
1f80025e62 [PATCH] msi: simplify msi sanity checks by adding with generic irq code
Currently msi.c is doing sanity checks that make certain before an irq is
destroyed it has no more users.

By adding irq_has_action I can perform the test is a generic way, instead of
relying on a msi specific data structure.

By performing the core check in dynamic_irq_cleanup I ensure every user of
dynamic irqs has a test present and we don't free resources that are in use.

In msi.c this allows me to kill the attrib.state member of msi_desc and all of
the assciated code to maintain it.

To keep from freeing data structures when irq cleanup code is called to soon
changing dyanamic_irq_cleanup is insufficient because there are msi specific
data structures that are also not safe to free.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <greg@kroah.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:29 -07:00
Eric W. Biederman
3a16d71362 [PATCH] genirq: irq: add a dynamic irq creation API
With the msi support comes a new concept in irq handling, irqs that are
created dynamically at run time.

Currently the msi code allocates irqs backwards.  First it allocates a
platform dependent routing value for an interrupt the ``vector'' and then it
figures out from the vector which irq you are on.

This msi backwards allocator suffers from two basic problems.  The allocator
suffers because it is trying to do something that is architecture specific in
a generic way making it brittle, inflexible, and tied to tightly to the
architecture implementation.  The alloctor also suffers from it's very
backwards nature as it has tied things together that should have no
dependencies.

To solve the basic dynamic irq allocation problem two new architecture
specific functions are added: create_irq and destroy_irq.

create_irq takes no input and returns an unused irq number, that won't be
reused until it is returned to the free poll with destroy_irq.  The irq then
can be used for any purpose although the only initial consumer is the msi
code.

destroy_irq takes an irq number allocated with create_irq and returns it to
the free pool.

Making this functionality per architecture increases the simplicity of the irq
allocation code and increases it's flexibility.

dynamic_irq_init() and dynamic_irq_cleanup() are added to automate the
irq_desc initializtion that should happen for dynamic irqs.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rajesh Shah <rajesh.shah@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:27 -07:00
Eric W. Biederman
e7b946e98a [PATCH] genirq: irq: add moved_masked_irq
Currently move_native_irq disables and renables the irq we are migrating to
ensure we don't take that irq when we are actually doing the migration
operation.  Disabling the irq needs to happen but sometimes doing the work is
move_native_irq is too late.

On x86 with ioapics the irq move sequences needs to be:
edge_triggered:
  mask irq.
  move irq.
  unmask irq.
  ack irq.
level_triggered:
  mask irq.
  ack irq.
  move irq.
  unmask irq.

We can easily perform the edge triggered sequence, with the current defintion
of move_native_irq.  However the level triggered case does not map well.  For
that I have added move_masked_irq, to allow me to disable the irqs around both
the ack and the move.

Q: Why have we not seen this problem earlier?

A: The only symptom I have been able to reproduce is that if we change
   the vector before acknowleding an irq the wrong irq is acknowledged.
   Since we currently are not reprogramming the irq vector during
   migration no problems show up.

   We have to mask the irq before we acknowledge the irq or else we could
   hit a window where an irq is asserted just before we acknowledge it.

   Edge triggered irqs do not have this problem because acknowledgements
   do not propogate in the same way.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rajesh Shah <rajesh.shah@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:26 -07:00
Eric W. Biederman
a24ceab4f4 [PATCH] genirq: irq: convert the move_irq flag from a 32bit word to a single bit
The primary aim of this patchset is to remove maintenances problems caused by
the irq infrastructure.  The two big issues I address are an artificially
small cap on the number of irqs, and that MSI assumes vector == irq.  My
primary focus is on x86_64 but I have touched other architectures where
necessary to keep them from breaking.

- To increase the number of irqs I modify the code to look at the (cpu,
  vector) pair instead of just looking at the vector.

  With a large number of irqs available systems with a large irq count no
  longer need to compress their irq numbers to fit.  Removing a lot of brittle
  special cases.

  For acpi guys the result is that irq == gsi.

- Addressing the fact that MSI assumes irq == vector takes a few more
  patches.  But suffice it to say when I am done none of the generic irq code
  even knows what a vector is.

In quick testing on a large Unisys x86_64 machine we stumbled over at least
one driver that assumed that NR_IRQS could always fit into an 8 bit number.
This driver is clearly buggy today.  But this has become a class of bugs that
it is now much easier to hit.

This patch:

This is a minor space optimization.  In practice I don't think this has any
affect because of our alignment constraints and the other fields but there is
not point in chewing up an uncessary word and since we already read the flag
field this should improve the cache hit ratio of the irq handler.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rajesh Shah <rajesh.shah@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:26 -07:00
Steve Grubb
ac9910ce01 [PATCH] name_count array overrun
Hi,

This patch removes the rdev logging from the previous patch

The below patch closes an unbounded use of name_count. This can lead to oopses
in some new file systems.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-10-04 08:31:21 -04:00
Alexander Viro
419c58f11f [PATCH] PPID filtering fix
On Thu, Sep 28, 2006 at 04:03:06PM -0400, Eric Paris wrote:
> After some looking I did not see a way to get into audit_log_exit
> without having set the ppid.  So I am dropping the set from there and
> only doing it at the beginning.
>
> Please comment/ack/nak as soon as possible.

Ehh...  That's one hell of an overhead to be had ;-/  Let's be lazy.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-10-04 08:31:19 -04:00
Eric Paris
4b8a311bb1 [PATCH] arch filter lists with < or > should not be accepted
Currently the kernel audit system represents arch's as numbers and will
gladly accept comparisons between archs using >, <, >=, <= when the only
thing that makes sense is = or !=.  I'm told that the next revision of
auditctl will do this checking but this will provide enforcement in the
kernel even for old userspace.  A simple command to show the issue would
be to run

auditctl -d entry,always -F arch>i686 -S chmod

with this patch the kernel will reject this with -EINVAL

Please comment/ack/nak as soon as possible.

-Eric

 kernel/auditfilter.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-10-04 08:31:16 -04:00
Dave Jones
038b0a6d8d Remove all inclusions of <linux/config.h>
kbuild explicitly includes this at build time.

Signed-off-by: Dave Jones <davej@redhat.com>
2006-10-04 03:38:54 -04:00
Josh Triplett
4802211cfd rcutorture: Fix incorrect description of default for nreaders parameter
The comment for the nreaders parameter of rcutorture gives the default as
4*ncpus, but the value actually defaults to 2*ncpus; fix the comment.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:26:16 +02:00
Rolf Eike Beer
9f5d785e93 remove duplicate "until" from kernel/workqueue.c
s/until until/until/

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:07:31 +02:00
Uwe Zeisberger
f30c226954 fix file specification in comments
Many files include the filename at the beginning, serveral used a wrong one.

Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:01:26 +02:00
Christoph Lameter
ce164428c4 [PATCH] scheduler: NUMA aware placement of sched_group_allnodes
When the per cpu sched domains are build then they also need to be placed
on the node where the cpu resides otherwise we will have frequent off node
accesses which will slow down the system.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:07 -07:00
Satoru Takeuchi
0feaece977 [PATCH] sched: fixing wrong comment for find_idlest_cpu()
Fixing wrong comment for find_idlest_cpu().

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:07 -07:00
Siddha, Suresh B
89c4710ee9 [PATCH] sched: cleanup sched_group cpu_power setup
Up to now sched group's cpu_power for each sched domain is initialized
independently.  This made the setup code ugly as the new sched domains are
getting added.

Make the sched group cpu_power setup code generic, by using domain child
field and new domain flag in sched_domain.  For most of the sched
domains(except NUMA), sched group's cpu_power is now computed generically
using the domain properties of itself and of the child domain.

sched groups in NUMA domains are setup little differently and hence they
don't use this generic mechanism.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Siddha, Suresh B
1a84887080 [PATCH] sched: introduce child field in sched_domain
Introduce the child field in sched_domain struct and use it in
sched_balance_self().

We will also use this field in cleaning up the sched group cpu_power
setup(done in a different patch) code.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Dave Jones
7473264643 [PATCH] sched: don't print migration cost when only 1 CPU
If only a single CPU is present, printing this doesn't make much sense.

Signed-off-by: Dave Jones <davej@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Siddha, Suresh B
a616058b78 [PATCH] sched: remove unnecessary sched group allocations
Remove dynamic sched group allocations for MC and SMP domains.  These
allocations can easily fail on big systems(1024 or so CPUs) and we can live
with out these dynamic allocations.

[akpm@osdl.org: build fix]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Nick Piggin
5c1e176781 [PATCH] sched: force /sbin/init off isolated cpus
Force /sbin/init off isolated cpus (unless every CPU is specified as an
isolcpu).

Users seem to think that the isolated CPUs shouldn't have much running on
them to begin with.  That's fair enough: intuitive, I guess.  It also means
that the cpu affinity masks of tasks will not include isolcpus by default,
which is also more intuitive, perhaps.

/sbin/init is spawned from the boot CPU's idle thread, and /sbin/init
starts the rest of userspace. So if the boot CPU is specified to be an
isolcpu, then prior to this patch, all of userspace will be run there.

(throw in a couple of plausible devinit -> cpuinit conversions I spotted
while we're here).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Randy Dunlap
e1ca66d1b9 [PATCH] kernel-doc for kernel/resource.c
Add kernel-doc function headers in kernel/resource.c and use them in DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:03:41 -07:00
Randy Dunlap
eed34d0fc5 [PATCH] kernel-doc for kernel/dma.c
Add kernel-doc function headers in kernel/dma.c and use it in DocBook.

Clean up kernel-doc in mca_dma.h (the colon (':') represents a
section header).

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:03:41 -07:00
Franck Bui-Huu
ffc5089196 [PATCH] Create kallsyms_lookup_size_offset()
Some uses of kallsyms_lookup() do not need to find out the name of a symbol
and its module's name it belongs.  This is specially true in arch specific
code, which needs to unwind the stack to show the back trace during oops
(mips is an example).  In this specific case, we just need to retreive the
function's size and the offset of the active intruction inside it.

Adds a new entry "kallsyms_lookup_size_offset()" This new entry does
exactly the same as kallsyms_lookup() but does not require any buffers to
store any names.

It returns 0 if it fails otherwise 1.

Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:03:41 -07:00
David Howells
3f2e05e90e [PATCH] BLOCK: Revert patch to hack around undeclared sigset_t in linux/compat.h
Revert Andrew Morton's patch to temporarily hack around the lack of a
declaration of sigset_t in linux/compat.h to make the block-disablement
patches build on IA64.  This got accidentally pushed to Linus and should
be fixed in a different manner.

Also make linux/compat.h #include asm/signal.h to gain a definition of
sigset_t so that it can externally declare sigset_from_compat().

This has been compile-tested for i386, x86_64, ia64, mips, mips64, frv, ppc and
ppc64 and run-tested on frv.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 08:03:31 -07:00
Cedric Le Goater
9ec52099e4 [PATCH] replace cad_pid by a struct pid
There are a few places in the kernel where the init task is signaled.  The
ctrl+alt+del sequence is one them.  It kills a task, usually init, using a
cached pid (cad_pid).

This patch replaces the pid_t by a struct pid to avoid pid wrap around
problem.  The struct pid is initialized at boot time in init() and can be
modified through systctl with

	/proc/sys/kernel/cad_pid

[ I haven't found any distro using it ? ]

It also introduces a small helper routine kill_cad_pid() which is used
where it seemed ok to use cad_pid instead of pid 1.

[akpm@osdl.org: cleanups, build fix]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:25 -07:00
Oleg Nesterov
1a657f78dc [PATCH] introduce get_task_pid() to fix unsafe get_pid()
proc_pid_make_inode:

	ei->pid = get_pid(task_pid(task));

I think this is not safe.  get_pid() can be preempted after checking "pid
!= NULL".  Then the task exits, does detach_pid(), and RCU frees the pid.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:25 -07:00
Arnd Bergmann
6760856791 [PATCH] introduce kernel_execve
The use of execve() in the kernel is dubious, since it relies on the
__KERNEL_SYSCALLS__ mechanism that stores the result in a global errno
variable.  As a first step of getting rid of this, change all users to a
global kernel_execve function that returns a proper error code.

This function is a terrible hack, and a later patch removes it again after the
kernel syscalls are gone.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:23 -07:00
Pavel
5d124e99c2 [PATCH] nsproxy cloning error path fix
This patch fixes copy_namespaces()'s error path.

when new nsproxy (new_ns) is created pointers to namespaces (ipc, uts) are
copied from the old nsproxy.  Later in copy_utsname, copy_ipcs, etc.
according namespaces are get-ed.  On error path needed namespaces are
put-ed, so there's no need to put new nsproxy itelf as it woud cause
putting namespaces for the second time.

Found when incorporating namespaces into OpenVZ kernel.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Kirill Korotaev
fcfbd547b1 [PATCH] IPC namespace - sysctls
Sysctl tweaks for IPC namespace

Signed-off-by: Pavel Emelianiov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Kirill Korotaev
73ea41302b [PATCH] IPC namespace - utils
This patch adds basic IPC namespace functionality to
IPC utils:
- init_ipc_ns
- copy/clone/unshare/free IPC ns
- /proc preparations

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Kirill Korotaev
25b21cb2f6 [PATCH] IPC namespace core
This patch set allows to unshare IPCs and have a private set of IPC objects
(sem, shm, msg) inside namespace.  Basically, it is another building block of
containers functionality.

This patch implements core IPC namespace changes:
- ipc_namespace structure
- new config option CONFIG_IPC_NS
- adds CLONE_NEWIPC flag
- unshare support

[clg@fr.ibm.com: small fix for unshare of ipc namespace]
[akpm@osdl.org: build fix]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Serge Hallyn
c0b2fc3165 [PATCH] uts: copy nsproxy only when needed
The nsproxy was being copied in unshare() when anything was being unshared,
even if it was something not referenced from nsproxy.  This should end up
in some cases with far more memory usage than necessary.

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Serge E. Hallyn
071df104f8 [PATCH] namespaces: utsname: implement CLONE_NEWUTS flag
Implement a CLONE_NEWUTS flag, and use it at clone and sys_unshare.

[clg@fr.ibm.com: IPC unshare fix]
[bunk@stusta.de: cleanup]
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Serge E. Hallyn
8218c74c02 [PATCH] namespaces: utsname: sysctl
Sysctl uts patch.  This will need to be done another way, but since sysctl
itself needs to be container aware, 'the right thing' is a separate patchset.

[akpm@osdl.org: ia64 build fix]
[sam.vilain@catalyst.net.nz: cleanup]
[sam.vilain@catalyst.net.nz: add proc_do_utsns_string]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
4865ecf131 [PATCH] namespaces: utsname: implement utsname namespaces
This patch defines the uts namespace and some manipulators.
Adds the uts namespace to task_struct, and initializes a
system-wide init namespace.

It leaves a #define for system_utsname so sysctl will compile.
This define will be removed in a separate patch.

[akpm@osdl.org: build fix, cleanup]
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
96b644bdec [PATCH] namespaces: utsname: use init_utsname when appropriate
In some places, particularly drivers and __init code, the init utsns is the
appropriate one to use.  This patch replaces those with a the init_utsname
helper.

Changes: Removed several uses of init_utsname().  Hope I picked all the
	right ones in net/ipv4/ipconfig.c.  These are now changed to
	utsname() (the per-process namespace utsname) in the previous
	patch (2/7)

[akpm@osdl.org: CIFS fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
e9ff3990f0 [PATCH] namespaces: utsname: switch to using uts namespaces
Replace references to system_utsname to the per-process uts namespace
where appropriate.  This includes things like uname.

Changes: Per Eric Biederman's comments, use the per-process uts namespace
	for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c

[jdike@addtoit.com: UML fix]
[clg@fr.ibm.com: cleanup]
[akpm@osdl.org: build fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Cedric Le Goater
fab413a334 [PATCH] namespaces: exit_task_namespaces() invalidates nsproxy
exit_task_namespaces() has replaced the former exit_namespace().  It
invalidates task->nsproxy and associated namespaces.  This is an issue for
the (futur) pid namespace which is required to be valid in exit_notify().

This patch moves exit_task_namespaces() after exit_notify() to keep nsproxy
valid.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
1651e14e28 [PATCH] namespaces: incorporate fs namespace into nsproxy
This moves the mount namespace into the nsproxy.  The mount namespace count
now refers to the number of nsproxies point to it, rather than the number of
tasks.  As a result, the unshare_namespace() function in kernel/fork.c no
longer checks whether it is being shared.

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Serge E. Hallyn
0437eb594e [PATCH] nsproxy: move init_nsproxy into kernel/nsproxy.c
Move the init_nsproxy definition out of arch/ into kernel/nsproxy.c.  This
avoids all arches having to be updated.  Compiles and boots on s390.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Serge E. Hallyn
ab516013ad [PATCH] namespaces: add nsproxy
This patch adds a nsproxy structure to the task struct.  Later patches will
move the fs namespace pointer into this structure, and introduce a new utsname
namespace into the nsproxy.

The vserver and openvz functionality, then, would be implemented in large part
by virtualizing/isolating more and more resources into namespaces, each
contained in the nsproxy.

[akpm@osdl.org: build fix]
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Adrian Bunk
b1ba4ddde0 [PATCH] make kernel/sysctl.c:_proc_do_string() static
This patch makes the needlessly global _proc_do_string() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Sam Vilain
f5dd3d6fad [PATCH] proc: sysctl: add _proc_do_string helper
The logic in proc_do_string is worth re-using without passing in a
ctl_table structure (say, we want to calculate a pointer and pass that in
instead); pass in the two fields it uses from that structure as explicit
arguments.

Signed-off-by: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Greg Banks
e16b38f713 [PATCH] cpumask: export cpu_online_map and cpu_possible_map consistently
cpumask: ensure that the cpu_online_map and cpu_possible_map bitmasks, and
hence all the macros in <linux/cpumask.h> that require them, are available to
modules for all supported combinations of architecture and CONFIG_SMP.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:17 -07:00
bibo,mao
99219a3fbc [PATCH] kretprobe spinlock deadlock patch
kprobe_flush_task() possibly calls kfree function during holding
kretprobe_lock spinlock, if kfree function is probed by kretprobe that will
incur spinlock deadlock.  This patch moves kfree function out scope of
kretprobe_lock.

Signed-off-by: bibo, mao <bibo.mao@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
bibo,mao
f2aa85a0cc [PATCH] disallow kprobes on notifier_call_chain
When kprobe is re-entered, the re-entered kprobe kernel path will will call
atomic_notifier_call_chain function, if this function is kprobed that will
incur numerous kprobe recursive fault.  This patch disallows kprobes on
atomic_notifier_call_chain function.

Signed-off-by: bibo, mao <bibo.mao@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
bibo,mao
62c27be0dd [PATCH] kprobe whitespace cleanup
Whitespace is used to indent, this patch cleans up these sentences by
kernel coding style.

Signed-off-by: bibo, mao <bibo.mao@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
Ananth N Mavinakayanahalli
3a872d89ba [PATCH] Kprobes: Make kprobe modules more portable
In an effort to make kprobe modules more portable, here is a patch that:

o Introduces the "symbol_name" field to struct kprobe.
  The symbol->address resolution now happens in the kernel in an
  architecture agnostic manner. 64-bit powerpc users no longer have
  to specify the ".symbols"
o Introduces the "offset" field to struct kprobe to allow a user to
  specify an offset into a symbol.
o The legacy mechanism of specifying the kprobe.addr is still supported.
  However, if both the kprobe.addr and kprobe.symbol_name are specified,
  probe registration fails with an -EINVAL.
o The symbol resolution code uses kallsyms_lookup_name(). So
  CONFIG_KPROBES now depends on CONFIG_KALLSYMS
o Apparantly kprobe modules were the only legitimate out-of-tree user of
  the kallsyms_lookup_name() EXPORT. Now that the symbol resolution
  happens in-kernel, remove the EXPORT as suggested by Christoph Hellwig
o Modify tcp_probe.c that uses the kprobe interface so as to make it
  work on multiple platforms (in its earlier form, the code wouldn't
  work, say, on powerpc)

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
Eric W. Biederman
2425c08b37 [PATCH] usb: fixup usb so it uses struct pid
The problem with remembering a user space process by its pid is that it is
possible that the process will exit, pid wrap around will occur.
Converting to a struct pid avoid that problem, and paves the way for
implementing a pid namespace.

Also since usb is the only user of kill_proc_info_as_uid rename
kill_proc_info_as_uid to kill_pid_info_as_uid and have the new version take
a struct pid.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Eric W. Biederman
f40f50d3bb [PATCH] Use struct pspace in next_pidmap and find_ge_pid
This updates my proc: readdir race fix (take 3) patch
to account for the changes made by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
to introduce struct pspace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Sukadev Bhattiprolu
3fbc964864 [PATCH] Define struct pspace
Define a per-container pid space object.  And create one instance of this
object, init_pspace, to define the entire pid space.  Subsequent patches
will provide/use interfaces to create/destroy pid spaces.

Its a subset/rework of Eric Biederman's patch
http://lkml.org/lkml/2006/2/6/285 .

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Sukadev Bhattiprolu
aa5a6662f9 [PATCH] Move pidmap to pspace.h
Move struct pidmap and PIDMAP_ENTRIES to a new file, include/linux/pspace.h
where it will be used in subsequent patches to define pid spaces.

Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/285

[akpm@osdl.org: cleanups]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Eric W. Biederman
c88be3eb2e [PATCH] pids coding style use struct pidmap in next_pidmap
Use struct pidmap instead of pidmap_t.

This updates my proc: readdir race fix (take 3) patch
to account for the changes made by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
to kill pidmap_t.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Sukadev Bhattiprolu
6a1f3b8455 [PATCH] pids: coding style: use struct pidmap
Use struct pidmap instead of pidmap_t.

Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/271.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:14 -07:00
Eric W. Biederman
609d7fa956 [PATCH] file: modify struct fown_struct to use a struct pid
File handles can be requested to send sigio and sigurg to processes.  By
tracking the destination processes using struct pid instead of pid_t we make
the interface safe from all potential pid wrap around problems.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:14 -07:00
Eric W. Biederman
bbf73147e2 [PATCH] pid: export the symbols needed to use struct pid *
pids aren't something that drivers should care about.  However there are a lot
of helper layers in the kernel that do care, and are built as modules.  Before
I can convert them to using struct pid instead of pid_t I need to export the
appropriate symbols so they can continue to be built.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:13 -07:00
Eric W. Biederman
c4b92fc112 [PATCH] pid: implement signal functions that take a struct pid *
Currently the signal functions all either take a task or a pid_t argument.
This patch implements variants that take a struct pid *.  After all of the
users have been update it is my intention to remove the variants that take a
pid_t as using pid_t can be more work (an extra hash table lookup) and
difficult to get right in the presence of multiple pid namespaces.

There are two kinds of functions introduced in this patch.  The are the
general use functions kill_pgrp and kill_pid which take a priv argument that
is ultimately used to create the appropriate siginfo information, Then there
are _kill_pgrp_info, kill_pgrp_info, kill_pid_info the internal implementation
helpers that take an explicit siginfo.

The distinction is made because filling out an explcit siginfo is tricky, and
will be even more tricky when pid namespaces are introduced.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:13 -07:00
Eric W. Biederman
0804ef4b0d [PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls.  For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.

This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.

Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.

This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned.  For directory that are
either created or destroyed during the readdir you may or may not see them.
 Furthermore you may seek to a directory offset you have previously seen.

These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects.  Plus it is a simple semantic to
implement reliable service.  It is just a matter of calling readdir a
second time if you are wondering if something new has show up.

These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.

The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm.  Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids.  There
are only 40 cache lines for the entire 32K pid space.  A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.

If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.

In addition this takes no additional locks and is actually less code than
what we are doing now.

Also another very subtle bug in this area has been fixed.  It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader.  This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.

Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.

[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:12 -07:00
Randy Dunlap
2bc2d61a96 [PATCH] list module taint flags in Oops/panic
When listing loaded modules during an oops or panic, also list each
module's Tainted flags if non-zero (P: Proprietary or F: Forced load only).

If a module is did not taint the kernel, it is just listed like
	usbcore
but if it did taint the kernel, it is listed like
	wizmodem(PF)

Example:
[ 3260.121718] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[ 3260.121729]  [<ffffffff8804c099>] :dump_test:proc_dump_test+0x99/0xc8
[ 3260.121742] PGD fe8d067 PUD 264a6067 PMD 0
[ 3260.121748] Oops: 0002 [1] SMP
[ 3260.121753] CPU 1
[ 3260.121756] Modules linked in: dump_test(P) snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ide_cd generic ohci1394 snd_hda_intel snd_hda_codec snd_pcm snd_timer snd ieee1394 snd_page_alloc piix ide_core arcmsr aic79xx scsi_transport_spi usblp
[ 3260.121785] Pid: 5556, comm: bash Tainted: P      2.6.18-git10 #1

[Alternatively, I can look into listing tainted flags with 'lsmod',
but that won't help in oopsen/panics so much.]

[akpm@osdl.org: cleanup]
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:12 -07:00
Andi Kleen
d025c9db7f [PATCH] Support piping into commands in /proc/sys/kernel/core_pattern
Using the infrastructure created in previous patches implement support to
pipe core dumps into programs.

This is done by overloading the existing core_pattern sysctl
with a new syntax:

|program

When the first character of the pattern is a '|' the kernel will instead
threat the rest of the pattern as a command to run.  The core dump will be
written to the standard input of that program instead of to a file.

This is useful for having automatic core dump analysis without filling up
disks.  The program can do some simple analysis and save only a summary of
the core dump.

The core dump proces will run with the privileges and in the name space of
the process that caused the core dump.

I also increased the core pattern size to 128 bytes so that longer command
lines fit.

Most of the changes comes from allowing core dumps without seeks.  They are
fairly straight forward though.

One small incompatibility is that if someone had a core pattern previously
that started with '|' they will get suddenly new behaviour.  I think that's
unlikely to be a real problem though.

Additional background:

> Very nice, do you happen to have a program that can accept this kind of
> input for crash dumps?  I'm guessing that the embedded people will
> really want this functionality.

I had a cheesy demo/prototype.  Basically it wrote the dump to a file again,
ran gdb on it to get a backtrace and wrote the summary to a shared directory.
Then there was a simple CGI script to generate a "top 10" crashes HTML
listing.

Unfortunately this still had the disadvantage to needing full disk space for a
dump except for deleting it afterwards (in fact it was worse because over the
pipe holes didn't work so if you have a holey address map it would require
more space).

Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
cores (at least it worked with zsh's =(cat core) syntax), so it would be
likely possible to do it without temporary space with a simple wrapper that
calls it in the right way.  I ran out of time before doing that though.

The demo prototype scripts weren't very good.  If there is really interest I
can dig them out (they are currently on a laptop disk on the desk with the
laptop itself being in service), but I would recommend to rewrite them for any
serious application of this and fix the disk space problem.

Also to be really useful it should probably find a way to automatically fetch
the debuginfos (I cheated and just installed them in advance).  If nobody else
does it I can probably do the rewrite myself again at some point.

My hope at some point was that desktops would support it in their builtin
crash reporters, but at least the KDE people I talked too seemed to be happy
with their user space only solution.

Alan sayeth:

  I don't believe that piping as such as neccessarily the right model, but
  the ability to intercept and processes core dumps from user space is asked
  for by many enterprise users as well.  They want to know about, capture,
  analyse and process core dumps, often centrally and in automated form.

[akpm@osdl.org: loff_t != unsigned long]
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Andi Kleen
e239ca5405 [PATCH] Create call_usermodehelper_pipe()
A new member in the ever growing family of call_usermode* functions is
born.  The new call_usermodehelper_pipe() function allows to pipe data to
the stdin of the called user mode progam and behaves otherwise like the
normal call_usermodehelp() (except that it always waits for the child to
finish)

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Dave Hansen
d8c76e6f45 [PATCH] r/o bind mount prepwork: inc_nlink() helper
This is mostly included for parity with dec_nlink(), where we will have some
more hooks.  This one should stay pretty darn straightforward for now.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:30 -07:00
Jay Lan
db5fed26b2 [PATCH] csa accounting taskstats update
ChangeLog:
   Feedbacks from Andrew Morton:
   - define TS_COMM_LEN to 32
   - change acct_stimexpd field of task_struct to be of
     cputime_t, which is to be used to save the tsk->stime
     of last timer interrupt update.
   - a new Documentation/accounting/taskstats-struct.txt
     to describe fields of taskstats struct.

   Feedback from Balbir Singh:
   - keep the stime of a task to be zero when both stime
     and utime are zero as recoreded in task_struct.

   Misc:
   - convert accumulated RSS/VM from platform dependent
     pages-ticks to MBytes-usecs in the kernel

Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan
8f0ab51479 [PATCH] csa: convert CONFIG tag for extended accounting routines
There were a few accounting data/macros that are used in CSA but are #ifdef'ed
inside CONFIG_BSD_PROCESS_ACCT.  This patch is to change those ifdef's from
CONFIG_BSD_PROCESS_ACCT to CONFIG_TASK_XACCT.  A few defines are moved from
kernel/acct.c and include/linux/acct.h to kernel/tsacct.c and
include/linux/tsacct_kern.h.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan
9acc185351 [PATCH] csa: Extended system accounting over taskstats
Add extended system accounting handling over taskstats interface.  A
CONFIG_TASK_XACCT flag is created to enable the extended accounting code.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan
f3cef7a994 [PATCH] csa: basic accounting over taskstats
Add some basic accounting fields to the taskstats struct, add a new
kernel/tsacct.c to handle basic accounting data handling upon exit.  A handle
is added to taskstats.c to invoke the basic accounting data handling.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Balbir Singh
0ae646845b [PATCH] Fix taskstats size calculation (use the new genetlink utility functions)
The addition of the CSA patch pushed the size of struct taskstats to 256
bytes.  This exposed a problem with prepare_reply(), we were not allocating
space for the netlink and genetlink header.  It worked earlier because
alloc_skb() would align the skb to SMP_CACHE_BYTES, which added some additonal
bytes.

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jamal Hadi <hadi@cyberus.ca>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Atsushi Nemoto
8ef386092d [PATCH] kill wall_jiffies
With 2.6.18-rc4-mm2, now wall_jiffies will always be the same as jiffies.
So we can kill wall_jiffies completely.

This is just a cleanup and logically should not change any real behavior
except for one thing: RTC updating code in (old) ppc and xtensa use a
condition "jiffies - wall_jiffies == 1".  This condition is never met so I
suppose it is just a bug.  I just remove that condition only instead of
kill the whole "if" block.

[heiko.carstens@de.ibm.com: s390 build fix and cleanup]
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Adrian Bunk
70bc42f90a [PATCH] kernel/time/ntp.c: possible cleanups
This patch contains the following possible cleanups:
- make the following needlessly global function static:
  - ntp_update_frequency()
- make the following needlessly global variables static:
  - time_state
  - time_offset
  - time_constant
  - time_reftime
- remove the following read-only global variable:
  - time_precision

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel
f199239373 [PATCH] ntp: convert to the NTP4 reference model
This converts the kernel ntp model into a model which matches the nanokernel
reference implementations.  The previous patches already increased the
resolution and precision of the computations, so that this conversion becomes
quite simple.

<linux@horizon.com> explains:

The original NTP kernel interface was defined in units of microseconds.
That's what Linux implements.  As computers have gotten faster and can now
split microseconds easily, a new kernel interface using nanosecond units was
defined ("the nanokernel", confusing as that name is to OS hackers), and
there's an STA_NANO bit in the adjtimex() status field to tell the application
which units it's using.

The current ntpd supports both, but Linux loses some possible timing
resolution because of quantization effects, and the ntpd hackers would really
like to be able to drop the backwards compatibility code.

Ulrich Windl has been maintaining a patch set to do the conversion for years,
but it's hard to keep in sync.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel
04b617e71e [PATCH] ntp: convert time_freq to nsec value
This converts time_freq to a scaled nsec value and adds around 6bit of extra
resolution.  This pushes the time_freq to its 32bit limits so the calculatons
have to be done with 64bit.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel
97eebe138c [PATCH] ntp: remove time_tolerance
time_tolerance isn't changed at all in the kernel, so simply remove it, this
simplifies the next patch, as it avoids a number of conversions.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
8f807f8d21 [PATCH] ntp: add time_adjust to tick length
This folds update_ntp_one_tick() into second_overflow() and adds time_adjust
to the tick length, this makes time_next_adjust unnecessary.  This slightly
changes the adjtime() behaviour, instead of applying it to the next tick, it's
applied to the next second.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
3d3675cc3d [PATCH] ntp: prescale time_offset
This converts time_offset into a scaled per tick value.  This avoids now
completely the crude compensation in second_overflow().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
dc6a43e46f [PATCH] ntp: add time_freq to tick length
This adds the frequency part to ntp_update_frequency().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
ab8783b688 [PATCH] ntp: add time_adj to tick length
This makes time_adj local to second_overflow() and integrates it into the tick
length instead of adding it everytime.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
b0ee75561b [PATCH] ntp: add ntp_update_frequency
This introduces ntp_update_frequency() and deinlines ntp_clear() (as it's not
performance critical).  ntp_update_frequency() calculates the base tick length
using tick_usec and adds a base adjustment, in case the frequency doesn't
divide evenly by HZ.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
john stultz
4c7ee8de95 [PATCH] NTP: Move all the NTP related code to ntp.c
Move all the NTP related code to ntp.c

[akpm@osdl.org: cleanups, build fix]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Martin Schwidefsky
ef6edc9746 [PATCH] Directed yield: cpu_relax variants for spinlocks and rw-locks
On systems running with virtual cpus there is optimization potential in
regard to spinlocks and rw-locks.  If the virtual cpu that has taken a lock
is known to a cpu that wants to acquire the same lock it is beneficial to
yield the timeslice of the virtual cpu in favour of the cpu that has the
lock (directed yield).

With CONFIG_PREEMPT="n" this can be implemented by the architecture without
common code changes.  Powerpc already does this.

With CONFIG_PREEMPT="y" the lock loops are coded with _raw_spin_trylock,
_raw_read_trylock and _raw_write_trylock in kernel/spinlock.c.  If the lock
could not be taken cpu_relax is called.  A directed yield is not possible
because cpu_relax doesn't know anything about the lock.  To be able to
yield the lock in favour of the current lock holder variants of cpu_relax
for spinlocks and rw-locks are needed.  The new _raw_spin_relax,
_raw_read_relax and _raw_write_relax primitives differ from cpu_relax
insofar that they have an argument: a pointer to the lock structure.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:21 -07:00
Cal Peake
756184b7d7 [PATCH] CodingStyle cleanup for kernel/sys.c
Fix up kernel/sys.c to be consistent with CodingStyle and the rest of the
file.

Signed-off-by: Cal Peake <cp@absolutedigital.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:20 -07:00
Arjan van de Ven
5c87579e65 [PATCH] maximum latency tracking infrastructure
Add infrastructure to track "maximum allowable latency" for power saving
policies.

The reason for adding this infrastructure is that power management in the
idle loop needs to make a tradeoff between latency and power savings
(deeper power save modes have a longer latency to running code again).  The
code that today makes this tradeoff just does a rather simple algorithm;
however this is not good enough: There are devices and use cases where a
lower latency is required than that the higher power saving states provide.
 An example would be audio playback, but another example is the ipw2100
wireless driver that right now has a very direct and ugly acpi hook to
disable some higher power states randomly when it gets certain types of
error.

The proposed solution is to have an interface where drivers can

* announce the maximum latency (in microseconds) that they can deal with
* modify this latency
* give up their constraint

and a function where the code that decides on power saving strategy can
query the current global desired maximum.

This patch has a user of each side: on the consumer side, ACPI is patched
to use this, on the producer side the ipw2100 driver is patched.

A generic maximum latency is also registered of 2 timer ticks (more and you
lose accurate time tracking after all).

While the existing users of the patch are x86 specific, the infrastructure
is not.  I'd like to ask the arch maintainers of other architectures if the
infrastructure is generic enough for their use (assuming the architecture
has such a tradeoff as concept at all), and the sound/multimedia driver
owners to look at the driver facing API to see if this is something they
can use.

[akpm@osdl.org: cleanups]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jesse Barnes <jesse.barnes@intel.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:19 -07:00
David Howells
9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
David Howells
07f3f05c1e [PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6]
Create a new header file, fs/internal.h, for common definitions local to the
sources in the fs/ directory.

Move extern definitions that should be in header files from fs/*.c to
fs/internal.h or other main header files where they span directories.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:18 +02:00
David Howells
0d67a46df0 [PATCH] BLOCK: Remove duplicate declaration of exit_io_context() [try #6]
Remove the duplicate declaration of exit_io_context() from linux/sched.h.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:20 +02:00
Andi Kleen
34596dc9e5 [PATCH] Define vsyscall cache as blob to make clearer that user space shouldn't use it
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Andi Kleen
29cbc78b90 [PATCH] x86: Clean up x86 NMI sysctls
Use prototypes in headers
Don't define panic_on_unrecovered_nmi for all architectures

Cc: dzickus@redhat.com

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Paul Jackson
181b648036 [PATCH] cpuset: fix obscure attach_task vs exiting race
Fix obscure race condition in kernel/cpuset.c attach_task() code.

There is basically zero chance of anyone accidentally being harmed by this
race.

It requires a special 'micro-stress' load and a special timing loop hacks
in the kernel to hit in less than an hour, and even then you'd have to hit
it hundreds or thousands of times, followed by some unusual and senseless
cpuset configuration requests, including removing the top cpuset, to cause
any visibly harm affects.

One could, with perhaps a few days or weeks of such effort, get the
reference count on the top cpuset below zero, and manage to crash the
kernel by asking to remove the top cpuset.

I found it by code inspection.

The race was introduced when 'the_top_cpuset_hack' was introduced, and one
piece of code was not updated.  An old check for a possibly null task
cpuset pointer needed to be changed to a check for a task marked
PF_EXITING.  The pointer can't be null anymore, thanks to
the_top_cpuset_hack (documented in kernel/cpuset.c).  But the task could
have gone into PF_EXITING state after it was found in the task_list scan.

If a task is PF_EXITING in this code, it is possible that its task->cpuset
pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
than because the top_cpuset was that tasks last valid cpuset.  In that
case, the wrong cpuset reference counter would be decremented.

The fix is trivial.  Instead of failing the system call if the tasks cpuset
pointer is null here, fail it if the task is in PF_EXITING state.

The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
the top_cpuset is done without locking, so could happen at anytime.  But it
is done during the exit handling, after the PF_EXITING flag is set.  So if
we verify that a task is still not PF_EXITING after we copy out its cpuset
pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
hack references to the top_cpuset.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Ingo Molnar
03cbc358aa [PATCH] lockdep core: improve the lock-chain-hash
With CONFIG_DEBUG_LOCK_ALLOC turned off i was getting sporadic failures in
the locking self-test:

  ------------>
  | Locking API testsuite:
  ----------------------------------------------------------------------------
                                   | spin |wlock |rlock |mutex | wsem | rsem |
    --------------------------------------------------------------------------
                       A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                   A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
               A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
               A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
           A-B-B-C-C-D-D-A deadlock:  ok  |FAILED|  ok  |  ok  |  ok  |  ok  |
           A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
           A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |FAILED|

after much debugging it turned out to be caused by accidental chain-hash
key collisions.  The current hash is:

 #define iterate_chain_key(key1, key2) \
	(((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
 	(key2))

where MAX_LOCKDEP_KEYS_BITS is 11.  This hash is pretty good as it will
shift by 5 bits in every iteration, where every new ID 'mixed' into the
hash would have up to 11 bits.  But because there was a 6 bits overlap
between subsequent IDs and their high bits tended to be similar, there was
a chance for accidental chain-hash collision for a low number of locks
held.

the solution is to shift by 11 bits:

 #define iterate_chain_key(key1, key2) \
	(((key1) << MAX_LOCKDEP_KEYS_BITS) ^ \
	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \
 	(key2))

This keeps the hash perfect up to 5 locks held, but even above that the
hash is still good because 11 bits is a relative prime to the total 64
bits, so a complete match will only occur after 64 held locks (which doesnt
happen in Linux).  Even after 5 locks held, entropy of the 5 IDs mixed into
the hash is already good enough so that overlap doesnt generate a colliding
hash ID.

with this change the false positives went away.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Alan Cox
eb84a20e9e [PATCH] audit/accounting: tty locking
Add tty locking around the audit and accounting code.

The whole current->signal-> locking is all deeply strange but it's for
someone else to sort out.  Add rather than replace the lock for acct.c

Signed-off-by: Alan Cox <alan@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Rusty Russell
e5582ca21a [PATCH] stop_machine.c copyright
I had to look back: this code was extracted from the module.c code in 2005.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:24 -07:00
Ian S. Nelson
04b1db9fd7 [PATCH] /sys/modules: allow full length section names
I've been using systemtap for some debugging and I noticed that it can't
probe a lot of modules.  Turns out it's kind of silly, the sections section
of /sys/module is limited to 32byte filenames and many of the actual
sections are a a bit longer than that.

[akpm@osdl.org: rewrite to use dymanic allocation]
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:23 -07:00
Paul Jackson
b1aac8bb82 [PATCH] cpuset: hotunplug cpus and mems in all cpusets
The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
it could remove a CPU or Node from the top cpuset, while leaving it still
in some child cpusets.

One basic rule of cpusets is that each cpusets cpus and mems are subsets of
its parents.  The cpuset hot unplug code violated this rule.

So the cpuset hotunplug handler must walk down the tree, removing any
removed CPU or Node from all cpusets.

However, it is not allowed to make a cpusets cpus or mems become empty.
They can only transition from empty to non-empty, not back.

So if the last CPU or Node would be removed from a cpuset by the above
walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
that still has something online, and copy its CPU or Memory placement.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Paul Jackson
38837fc75a [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code.  Make this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset.  This is a surprising regression
over earlier kernels that didn't have cpusets enabled.

One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.

This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets.  The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed.  With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes.  Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map.  Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.

[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov
c394cc9fbb [PATCH] introduce TASK_DEAD state
I am not sure about this patch, I am asking Ingo to take a decision.

task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
live only in ->exit_state as documented in sched.h.

Note that this state is not visible to user-space, get_task_state() masks off
unsuitable states.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov
55a101f8f7 [PATCH] kill PF_DEAD flag
After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we
don't need PF_DEAD any longer.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:20 -07:00
Oleg Nesterov
29b8849216 [PATCH] set EXIT_DEAD state in do_exit(), not in schedule()
schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD
to ensure that the exiting task will be deactivated.  Note that this EXIT_DEAD
is in fact a "random" value, we can use any bit except normal TASK_XXX values.

It is better to set this state in do_exit() along with PF_DEAD flag and remove
that check in schedule().

We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it
can not change task's ->state: the 'state' argument of try_to_wake_up() can't
have EXIT_DEAD bit.  And in case when try_to_wake_up() sees a stale value of
->state == TASK_RUNNING it will do nothing.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:20 -07:00
Oleg Nesterov
aaa2a97eb9 [PATCH] sys_get_robust_list(): don't take tasklist_lock
use rcu locks for find_task_by_pid().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
d359b549bf [PATCH] futex_find_get_task(): don't take tasklist_lock
It is ok to do find_task_by_pid() + get_task_struct() under
rcu_read_lock(), we cand drop tasklist_lock.

Note that testing of ->exit_state is racy with or without tasklist anyway.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
5b160f5ecd [PATCH] copy_process: cosmetic ->ioprio tweak
copy_process:
// holds tasklist_lock + ->siglock
       /*
        * inherit ioprio
        */
       p->ioprio = current->ioprio;

Why?  ->ioprio was already copied in dup_task_struct().  I guess this is
needed to ensure that the child can't escape
sys_ioprio_set(IOPRIO_WHO_{PGRP,USER}), yes?

In that case we don't need ->siglock held, and the comment should be
updated.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
1c573afebc [PATCH] reparent_to_init(): use has_rt_policy()
Remove open-coded has_rt_policy(), no changes in kernel/exit.o

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
8dc3e9099e [PATCH] sched_setscheduler: fix? policy checks
I am not sure this patch is correct: I can't understand what the current
code does, and I don't know what it was supposed to do.

The comment says:

		 * can't change policy, except between SCHED_NORMAL
		 * and SCHED_BATCH:

The code:

		if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) &&
			(policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) &&

But this is equivalent to:

		if ( (is_rt_policy(policy) && has_rt_policy(p)) &&

which means something different.  We can't _decrease_ the current
->rt_priority with such a check (if rlim[RLIMIT_RTPRIO] == 0).

Probably, it was supposed to be:

		if (	!(policy == SCHED_NORMAL && p->policy == SCHED_BATCH)  &&
			!(policy == SCHED_BATCH  && p->policy == SCHED_NORMAL)

this matches the comment, but strange: it doesn't allow to _drop_ the
realtime priority when rlim[RLIMIT_RTPRIO] == 0.

I think the right check would be:

		/* can't set/change rt policy */
		if (is_rt_policy(policy) &&
				policy != p->policy &&
				!rlim_rtprio)
			return -EPERM;

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Oleg Nesterov
57a6f51c42 [PATCH] introduce is_rt_policy() helper
Imho, makes the code a bit easier to read.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Oleg Nesterov
5fe1d75f34 [PATCH] do_sched_setscheduler(): don't take tasklist_lock
Use rcu locks instead. sched_setscheduler() now takes ->siglock
before reading ->signal->rlim[].

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Cal Peake
c9472e0f28 [PATCH] kill extraneous printk in kernel_restart()
Get rid of an extraneous printk in kernel_restart().

Signed-off-by: Cal Peake <cp@absolutedigital.net>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Björn Steinbrink
111dbe0c8a [PATCH] Fix ____call_usermodehelper errors being silently ignored
If ____call_usermodehelper fails, we're not interested in the child
process' exit value, but the real error, so let's stop wait_for_helper from
overwriting it in that case.

Issue discovered by Benedikt Böhm while working on a Linux-VServer usermode
helper.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Alan Cox
54306cf04c [PATCH] exit: fix crash case
If we are going to BUG() not panic() here then we should cover the case of
the BUG being compiled out

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Roland McGrath
0b4a8a789a [PATCH] kexec warning fix
This fixes a couple of compiler warnings, and adds paranoia checks as well.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Atsushi Nemoto
3171a0305d [PATCH] simplify update_times (avoid jiffies/jiffies_64 aliasing problem)
Pass ticks to do_timer() and update_times(), and adjust x86_64 and s390
timer interrupt handler with this change.

Currently update_times() calculates ticks by "jiffies - wall_jiffies", but
callers of do_timer() should know how many ticks to update.  Passing ticks
get rid of this redundant calculation.  Also there are another redundancy
pointed out by Martin Schwidefsky.

This cleanup make a barrier added by
5aee405c66 needless.  So this patch removes
it.

As a bonus, this cleanup make wall_jiffies can be removed easily, since now
wall_jiffies is always synced with jiffies.  (This patch does not really
remove wall_jiffies.  It would be another cleanup patch)

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Roland McGrath
27d91e07f9 [PATCH] __dequeue_signal() cleanup
This tightens up __dequeue_signal a little.  It also avoids doing
recalc_sigpending twice in a row, instead doing it once in dequeue_signal.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Roland McGrath
b9ecb2bd5d [PATCH] has_stopped_jobs() cleanup
This check has been obsolete since the introduction of TASK_TRACED.  Now
TASK_STOPPED always means job control stop.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Toyo Abe
e4b765551a [PATCH] posix-timers: Fix the flags handling in posix_cpu_nsleep()
When a posix_cpu_nsleep() sleep is interrupted by a signal more than twice, it
incorrectly reports the sleep time remaining to the user.  Because
posix_cpu_nsleep() doesn't report back to the user when it's called from
restart function due to the wrong flags handling.

This patch, which applies after previous one, moves the nanosleep() function
from posix_cpu_nsleep() to do_cpu_nanosleep() and cleans up the flags handling
appropriately.

Signed-off-by: Toyo Abe <toyoa@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Toyo Abe
1711ef3866 [PATCH] posix-timers: Fix clock_nanosleep() doesn't return the remaining time in compatibility mode
The clock_nanosleep() function does not return the time remaining when the
sleep is interrupted by a signal.

This patch creates a new call out, compat_clock_nanosleep_restart(), which
handles returning the remaining time after a sleep is interrupted.  This
patch revives clock_nanosleep_restart().  It is now accessed via the new
call out.  The compat_clock_nanosleep_restart() is used for compatibility
access.

Since this is implemented in compatibility mode the normal path is
virtually unaffected - no real performance impact.

Signed-off-by: Toyo Abe <toyoa@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Akinobu Mita
07dccf3344 [PATCH] check return value of cpu_callback
Spawing ksoftirqd, migration, or watchdog, and calling init_timers_cpu()
may fail with small memory.  If it happens in initcalls, kernel NULL
pointer dereference happens later.  This patch makes crash happen
immediately in such cases.  It seems a bit better than getting kernel NULL
pointer dereference later.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:14 -07:00
Paul E. McKenney
a45bce4954 [PATCH] memory ordering in __kfifo primitives
Both __kfifo_put() and __kfifo_get() have header comments stating that if
there is but one concurrent reader and one concurrent writer, locking is not
necessary.  This is almost the case, but a couple of memory barriers are
needed.  Another option would be to change the header comments to remove the
bit about locking not being needed, and to change the those callers who
currently don't use locking to add the required locking.  The attachment
analyzes this approach, but the patch below seems simpler.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Stelian Pop <stelian@popies.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:13 -07:00
Dave Jones
99de055ac0 [PATCH] lockdep: print kernel version
Lets do the same thing we do for oopses - print out the version in the
report.  It's an extra line of output though.  We could tack it on the end
of the INFO: lines, but that screws up Ingo's pretty output.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:13 -07:00
Sukadev Bhattiprolu
f400e198b2 [PATCH] pidspace: is_init()
This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280).  It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().

Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.

Eric's original description:

	There are a lot of places in the kernel where we test for init
	because we give it special properties.  Most  significantly init
	must not die.  This results in code all over the kernel test
	->pid == 1.

	Introduce is_init to capture this case.

	With multiple pid spaces for all of the cases affected we are
	looking for only the first process on the system, not some other
	process that has pid == 1.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: <lxc-devel@lists.sourceforge.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:12 -07:00
Kirill Korotaev
3b9b8ab65d [PATCH] Fix unserialized task->files changing
Fixed race on put_files_struct on exec with proc.  Restoring files on
current on error path may lead to proc having a pointer to already kfree-d
files_struct.

->files changing at exit.c and khtread.c are safe as exit_files() makes all
things under lock.

Found during OpenVZ stress testing.

[akpm@osdl.org: add export]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:12 -07:00
Chuck Ebbert
e6cab99bb4 [PATCH] unwind: fix unused variable warning when !CONFIG_MODULES
Fix "variable defined but not used" compiler warning in unwind.c when
CONFIG_MODULES is not set.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:11 -07:00
Rolf Eike Beer
2aae4a108d [PATCH] Fix kerneldoc comments in kernel/timer.c
Some of the kerneldoc comments in this file are ignored since the lead-in
is malformed, using either "/*" or "/***" instead of "/**".

[rdunlap@xenotime.net: kerneldoc fixes]
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:10 -07:00
Steven Rostedt
db630637b2 [PATCH] clean up and remove some extra spinlocks from rtmutex
Oleg brought up some interesting points about grabbing the pi_lock for some
protections.  In this discussion, I realized that there are some places
that the pi_lock is being grabbed when it really wasn't necessary.  Also
this patch does a little bit of clean up.

This patch basically does three things:

1) renames the "boost" variable to "chain_walk".  Since it is used in
   the debugging case when it isn't going to be boosted.  It better
   describes what the test is going to do if it succeeds.

2) moves get_task_struct to just before the unlocking of the wait_lock.
   This removes duplicate code, and makes it a little easier to read.  The
   owner wont go away while either the pi_lock or the wait_lock are held.

3) removes the pi_locking and owner blocked checking completely from the
   debugging case.  This is because the grabbing the lock and doing the
   check, then releasing the lock is just so full of races.  It's just as
   good to go ahead and call the pi_chain_walk function, since after
   releasing the lock the owner can then block anyway, and we would have
   missed that.  For the debug case, we really do want to do the chain walk
   to test for deadlocks anyway.

[oleg@tv-sign.ru: more of the same]
Signed-of-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Esben Nielsen <nielsen.esben@googlemail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Alexey Dobriyan
6c5c934153 [PATCH] ifdef blktrace debugging fields
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Josh Triplett
89e7e374dd [PATCH] timer: add lock annotation to lock_timer_base
lock_timer_base acquires a lock and returns with that lock held.  Add a
lock annotation to this function so that sparse can check callers for lock
pairing, and so that sparse will not complain about this function since it
intentionally uses the lock in this manner.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Mark Huang
d10be6d1bd [PATCH] module_subsys: initialize earlier
Initialize module_subsys earlier (or at least earlier than devices) since
it could be used very early in the boot process if kmod loads a module
before the device initcalls.  Otherwise, kmod will crash in
kernel/module.c:mod_sysfs_setup() since the kset in module_subsys is not
initialized yet.

I only noticed this problem because occasionally, kmod loads the modules
for my SCSI and Ethernet adapters very early, during the boot process
itself.  I don't quite understand why it loads them sometimes and doesn't
load them other times.  Or who is telling kmod to do so.  Can someone
explain?

Signed-off-by: Mark Huang <mlhuang@cs.princeton.edu>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:08 -07:00
Josh Triplett
a49a4af759 [PATCH] rcu: add lock annotations to rcu{,_bh}_torture_read_{lock,unlock}
rcu_torture_read_lock and rcu_bh_torture_read_lock acquire locks without
releasing them, and the matching functions rcu_torture_read_unlock and
rcu_bh_torture_read_unlock get called with the corresponding locks held and
release them.  Add lock annotations to these four functions so that sparse
can check callers for lock pairing, and so that sparse will not complain
about these functions since they intentionally use locks in this manner.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:08 -07:00
Yoichi Yuasa
538d9d532b [PATCH] irq: remove a extra line
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:07 -07:00
Yoichi Yuasa
2ff6fd8f4a [PATCH] irq: fixed coding style
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:07 -07:00
Randy Dunlap
4c78a66393 [PATCH] kernel-doc for relay interface
Add relay interface support to DocBook/kernel-api.tmpl.  Fix typos etc.  in
relay.c and relayfs.txt.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:06 -07:00
Randy Dunlap
d8c7649e99 [PATCH] kernel/params: driver layer error checking
Check driver layer return values in kernel/params.c

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:04 -07:00
Matthew Wilcox
910067d188 [PATCH] remove generic__raw_read_trylock()
If the cpu has the lock held for write, is interrupted, and the interrupt
handler calls read_trylock(), it's an instant deadlock.

Now, Dave Miller has subsequently pointed out that we don't have any
situations where this can occur.  Nevertheless, we should delete
generic__raw_read_lock (and its associated EXPORT to make Arjan happy) so that
nobody thinks they can use it.

Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:03 -07:00
Linus Torvalds
ebdea46fec Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (130 commits)
  [ARM] 3856/1: Add clocksource for Intel IXP4xx platforms
  [ARM] 3855/1: Add generic time support
  [ARM] 3873/1: S3C24XX: Add irq_chip names
  [ARM] 3872/1: S3C24XX: Apply consistant tabbing to irq_chips
  [ARM] 3871/1: S3C24XX: Fix ordering of EINT4..23
  [ARM] nommu: confirms the CR_V bit in nommu mode
  [ARM] nommu: abort handler fixup for !CPU_CP15_MMU cores.
  [ARM] 3870/1: AT91: Start removing static memory mappings
  [ARM] 3869/1: AT91: NAND support for DK and KB9202 boards
  [ARM] 3868/1: AT91 hardware header update
  [ARM] 3867/1: AT91 GPIO update
  [ARM] 3866/1: AT91 clock update
  [ARM] 3865/1: AT91RM9200 header updates
  [ARM] 3862/2: S3C2410 - add basic power management support for AML M5900 series
  [ARM] kthread: switch arch/arm/kernel/apm.c
  [ARM] Off-by-one in arch/arm/common/icst*
  [ARM] 3864/1: Refactore sharpsl_pm
  [ARM] 3863/1: Add Locomo SPI Device
  [ARM] 3847/2:  Convert LOMOMO to use struct device for GPIOs
  [ARM] Use CPU_CACHE_* where possible in asm/cacheflush.h
  ...
2006-09-28 14:40:39 -07:00
Russell King
2dc94310bd Merge master.kernel.org:/pub/scm/linux/kernel/git/tmlind/linux-omap-upstream into devel 2006-09-27 19:57:54 +01:00
Eric W. Biederman
65800ac77e [PATCH] pid: remove temporary debug code in attach_pid
With the patches flying between Oleg and myself somehow this temporary
debug code got left in pid.c.  It was never intended to make it to the
stable kernel.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Eric W. Biederman
c18258c6f0 [PATCH] pid: Implement transfer_pid and use it to simplify de_thread
In de_thread we move pids from one process to another, a rather ugly case.
The function transfer_pid makes it clear what we are doing, and makes the
action atomic.  This is useful we ever want to atomically traverse the
process group and session lists, in a rcu safe manner.

Even if the atomic properties this change should be a win as transfer_pid
should be less code to execute than executing both attach_pid and
detach_pid, and this should make de_thread slightly smaller as only a
single function call needs to be emitted.  The only downside is that the
code might be slower to execute as the odds are against transfer_pid being
in cache.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Eric W. Biederman
b89a81712f [PATCH] sysctl: Allow /proc/sys without sys_sysctl
Since sys_sysctl is deprecated start allow it to be compiled out.  This
should catch any remaining user space code that cares, and paves the way
for further sysctl cleanups.

[akpm@osdl.org: If sys_sysctl() is not compiled-in, emit a warning]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Theodore Ts'o
ba52de123d [PATCH] inode-diet: Eliminate i_blksize from the inode structure
This eliminates the i_blksize field from struct inode.  Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:18 -07:00
Theodore Ts'o
8e18e2941c [PATCH] inode_diet: Replace inode.u.generic_ip with inode.i_private
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86.  (It would be more on an x86_64 system).  This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).

This patch:

The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer.  Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer.  This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.

[judith@osdl.org: powerpc build fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Judith Lebzelter <judith@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:17 -07:00
David Howells
f269fdd182 [PATCH] NOMMU: move the fallback arch_vma_name() to a sensible place
Move the fallback arch_vma_name() to a sensible place (kernel/signal.c).

Currently it's in fs/proc/task_mmu.c, a file that is dependent on both
CONFIG_PROC_FS and CONFIG_MMU being enabled, but it's used from
kernel/signal.c from where it is called unconditionally.

[akpm@osdl.org: build fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:15 -07:00
David Howells
0ec76a110f [PATCH] NOMMU: Check that access_process_vm() has a valid target
Check that access_process_vm() is accessing a valid mapping in the target
process.

This limits ptrace() accesses and accesses through /proc/<pid>/maps to only
those regions actually mapped by a program.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
Matthew Wilcox
d33b6fba2c Resources: insert identical resources above existing resources
If you have two resources which aree exactly the same size,
insert_resource() currently inserts the new one below the existing one. 
This is wrong because there's no way to insert a resource of the same size
above an existing one.

I took this opportunity to rewrite the initial loop to be a for-loop
instead of a goto-loop and fix the documentation.

Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-26 17:43:52 -07:00
Linus Torvalds
b278240839 Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
  [PATCH] Don't set calgary iommu as default y
  [PATCH] i386/x86-64: New Intel feature flags
  [PATCH] x86: Add a cumulative thermal throttle event counter.
  [PATCH] i386: Make the jiffies compares use the 64bit safe macros.
  [PATCH] x86: Refactor thermal throttle processing
  [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
  [PATCH] Fix unwinder warning in traps.c
  [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
  [PATCH] x86: Move direct PCI scanning functions out of line
  [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
  [PATCH] Don't leak NT bit into next task
  [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
  [PATCH] Fix some broken white space in ia32_signal.c
  [PATCH] Initialize argument registers for 32bit signal handlers.
  [PATCH] Remove all traces of signal number conversion
  [PATCH] Don't synchronize time reading on single core AMD systems
  [PATCH] Remove outdated comment in x86-64 mmconfig code
  [PATCH] Use string instructions for Core2 copy/clear
  [PATCH] x86: - restore i8259A eoi status on resume
  [PATCH] i386: Split multi-line printk in oops output.
  ...
2006-09-26 13:07:55 -07:00
Linus Torvalds
dd77a4ee0f Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (47 commits)
  Driver core: Don't call put methods while holding a spinlock
  Driver core: Remove unneeded routines from driver core
  Driver core: Fix potential deadlock in driver core
  PCI: enable driver multi-threaded probe
  Driver Core: add ability for drivers to do a threaded probe
  sysfs: add proper sysfs_init() prototype
  drivers/base: check errors
  drivers/base: Platform notify needs to occur before drivers attach to the device
  v4l-dev2: handle __must_check
  add CONFIG_ENABLE_MUST_CHECK
  add __must_check to device management code
  Driver core: fixed add_bind_files() definition
  Driver core: fix comments in drivers/base/power/resume.c
  sysfs_remove_bin_file: no return value, dump_stack on error
  kobject: must_check fixes
  Driver core: add ability for devices to create and remove bin files
  Class: add support for class interfaces for devices
  Driver core: create devices/virtual/ tree
  Driver core: add device_rename function
  Driver core: add ability for classes to handle devices properly
  ...
2006-09-26 11:49:46 -07:00
Rafael J. Wysocki
c5c6ba4e08 [PATCH] PM: Add pm_trace switch
Add the pm_trace attribute in /sys/power which has to be explicitly set to
one to really enable the "PM tracing" code compiled in when CONFIG_PM_TRACE
is set (which modifies the machine's CMOS clock in unpredictable ways).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:04 -07:00
Rafael J. Wysocki
c8eb8b4025 [PATCH] PM: make it possible to disable console suspending
Change suspend_console() so that it waits for all consoles to flush the
remaining messages and make it possible to switch the console suspending off
with the help of a Kconfig option.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Stefan Seyfried <seife@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:03 -07:00
Rafael J. Wysocki
940864ddab [PATCH] swsusp: Use memory bitmaps during resume
Make swsusp use memory bitmaps to store its internal information during the
resume phase of the suspend-resume cycle.

If the pfns of saveable pages are saved during the suspend phase instead of
the kernel virtual addresses of these pages, we can use them during the resume
phase directly to set the corresponding bits in a memory bitmap.  Then, this
bitmap is used to mark the page frames corresponding to the pages that were
saveable before the suspend (aka "unsafe" page frames).

Next, we allocate as many page frames as needed to store the entire suspend
image and make sure that there will be some extra free "safe" page frames for
the list of PBEs constructed later.  Subsequently, the image is loaded and, if
possible, the data loaded from it are written into their "original" page
frames (ie.  the ones they had occupied before the suspend).

The image data that cannot be written into their "original" page frames are
loaded into "safe" page frames and their "original" kernel virtual addresses,
as well as the addresses of the "safe" pages containing their copies, are
stored in a list of PBEs.  Finally, the list of PBEs is used to copy the
remaining image data into their "original" page frames (this is done
atomically, by the architecture-dependent parts of swsusp).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki
b788db7989 [PATCH] swsusp: Introduce memory bitmaps
Introduce the memory bitmap data structure and make swsusp use in the suspend
phase.

The current swsusp's internal data structure is not very efficient from the
memory usage point of view, so it seems reasonable to replace it with a data
structure that will require less memory, such as a pair of bitmaps.

The idea is to use bitmaps that may be allocated as sets of individual pages,
so that we can avoid making allocations of order greater than 0.  For this
reason the memory bitmap structure consists of several linked lists of objects
that contain pointers to memory pages with the actual bitmap data.  Still, for
a typical system all of these lists fit in a single page, so it's reasonable
to introduce an additional mechanism allowing us to allocate all of them
efficiently without sacrificing the generality of the design.  This is done
with the help of the chain_allocator structure and associated functions.

We need to use two memory bitmaps during the suspend phase of the
suspend-resume cycle.  One of them is necessary for marking the saveable
pages, and the second is used to mark the pages in which to store the copies
of them (aka image pages).

First, the bitmaps are created and we allocate as many image pages as needed
(the corresponding bits in the second bitmap are set as soon as the pages are
allocated).  Second, the bits corresponding to the saveable pages are set in
the first bitmap and the saveable pages are copied to the image pages.
Finally, the first bitmap is used to save the kernel virtual addresses of the
saveable pages and the second one is used to save the contents of the image
pages.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki
0bcd888d64 [PATCH] swsusp: Introduce some helpful constants
Introduce some constants that hopefully will help improve the readability of
code in kernel/power/snapshot.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki
75534b50cc [PATCH] Change the name of pagedir_nosave
The name of the pagedir_nosave variable does not make sense any more, so it
seems reasonable to change it to something more meaningful.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:01 -07:00
Rafael J. Wysocki
cd560bb2f9 [PATCH] swsusp: Fix alloc_pagedir
Get rid of the FIXME in kernel/power/snapshot.c#alloc_pagedir() and
simplify the functions called by it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
f6143aa60e [PATCH] swsusp: Reorder memory-allocating functions
Move some functions in kernel/power/snapshot.c to a better place (in the
same file) and introduce free_image_page() (will be necessary in the
future).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
f623f0db8e [PATCH] swsusp: Fix mark_free_pages
Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
PageNosaveFree for PageNosave pages.  This allows us to get rid of an ugly
hack in kernel/power/snapshot.c#copy_data_pages().

Additionally, the page-copying loop in copy_data_pages() is moved to an
inline function.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
e3920fb42c [PATCH] Disable CPU hotplug during suspend
The current suspend code has to be run on one CPU, so we use the CPU
hotplug to take the non-boot CPUs offline on SMP machines.  However, we
should also make sure that these CPUs will not be enabled by someone else
after we have disabled them.

The functions disable_nonboot_cpus() and enable_nonboot_cpus() are moved to
kernel/cpu.c, because they now refer to some stuff in there that should
better be static.  Also it's better if disable_nonboot_cpus() returns an
error instead of panicking if something goes wrong, and
enable_nonboot_cpus() has no reason to panic(), because the CPUs may have
been enabled by the userland before it tries to take them online.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
fb13a28b0f [PATCH] swsusp: struct snapshot_handle cleanup
Add comments describing struct snapshot_handle and its members, change the
confusing name of its member 'page' to 'cur'.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Rafael J. Wysocki
ae83c5eef5 [PATCH] swsusp: clean up browsing of pfns
Clean up some loops over pfns for each zone in snapshot.c: reduce the
number of additions to perform, rework detection of saveable pages and make
the code a bit less difficult to understand, hopefully.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
546e0d2719 [PATCH] swsusp: read speedup
Implement async reads for swsusp resuming.

Crufty old PIII testbox:
	15.7 MB/s -> 20.3 MB/s

Sony Vaio:
	14.6 MB/s -> 33.3 MB/s

I didn't implement the post-resume bio_set_pages_dirty().  I don't really
understand why resume needs to run set_page_dirty() against these pages.

It might be a worry that this code modifies PG_Uptodate, PG_Error and
PG_Locked against the image pages.  Can this possibly affect the resumed-into
kernel?  Hopefully not, if we're atomically restoring its mem_map?

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jens Axboe <axboe@suse.de>
Cc: Laurent Riffard <laurent.riffard@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
8c002494b5 [PATCH] swsusp: add read-speed instrumentation
Add some instrumentation to the swsusp readin code to show what bandwidth
we're achieving.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
ab95416035 [PATCH] swsusp: write speedup
Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

Crufty old PIII testbox:
	12.9 MB/s -> 20.9 MB/s

Sony Vaio:
	14.7 MB/s -> 26.5 MB/s

The implementation is crude.  A better one would use larger BIOs, but wouldn't
gain any performance.

The memcpys will be mostly pipelined with the IO and basically come for free.

The ENOMEM path has not been tested.  It should be.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
3a4f7577c9 [PATCH] swsusp: add write-speed instrumentation
Add some instrumentation to the swsusp writeout code to show what bandwidth
we're achieving.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
David Howells
af8c65b57a [PATCH] FRV: permit __do_IRQ() to be dispensed with
Permit __do_IRQ() to be dispensed with based on a configuration option.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:53 -07:00
Stephen Smalley
1a70cd40cb [PATCH] selinux: rename selinux_ctxid_to_string
Rename selinux_ctxid_to_string to selinux_sid_to_string to be
consistent with other interfaces.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Stephen Smalley
62bac0185a [PATCH] selinux: eliminate selinux_task_ctxid
Eliminate selinux_task_ctxid since it duplicates selinux_task_get_sid.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter
89fa30242f [PATCH] NUMA: Add zone_to_nid function
There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM.  Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter
0ff38490c8 [PATCH] zone_reclaim: dynamic slab reclaim
Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode.  Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.

However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs.  We have had a case where a machine with
46GB of memory was using 40-42GB for slab.  Zone reclaim was effective in
dealing with pagecache pages.  However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).

This patch implements slab reclaim during zone reclaim.  Zone reclaim
occurs if there is a danger of an off node allocation.  At that point we

1. Shrink the per node page cache if the number of pagecache
   pages is more than min_unmapped_ratio percent of pages in a zone.

2. Shrink the slab cache if the number of the nodes reclaimable slab pages
   (patch depends on earlier one that implements that counter)
   are more than min_slab_ratio (a new /proc/sys/vm tunable).

The shrinking of the slab cache is a bit problematic since it is not node
specific.  So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit.  I hope we will have zone based
slab reclaim at some point which will make that easier.

The default for the min_slab_ratio is 5%

Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter
fbd98167e6 [PATCH] Profiling: require buffer allocation on the correct node
Profiling really suffers with off node buffers.  Fail if no memory is
available on the nodes.  The profiling code can deal with these failures
should they occur.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter
9b819d204c [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions
Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes.  This
flag is essential if a kernel component requires memory to be located on a
certain node.  It will be needed for alloc_pages_node() to force allocation
on the indicated node and for alloc_pages() to force allocation on the
current node.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter
0a2966b48f [PATCH] Fix longstanding load balancing bug in the scheduler
The scheduler will stop load balancing if the most busy processor contains
processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu.  If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e.  If you have processor 0 busy running four tasks pinned via taskset,
there are none on processor 1 and one just started two processes on
processor 2 then the scheduler will not move one of the two processes away
from processor 2.

This patch fixes that issue by forcing the scheduler to come out of its
corner and retrying the load balancing by considering other processors for
load balancing.

This patch was originally developed by John Hawkes and discussed at

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

I have removed extraneous material and gone back to equipping struct rq
with the cpu the queue is associated with since this makes the patch much
easier and it is likely that others in the future will have the same
difficulty of figuring out which processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit).  For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack.  On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack.  The maximum additional cache footprint is one
cacheline.  Typically memory use will be much less than a cacheline and the
additional cpumask will be placed on the stack in a cacheline that already
contains other local variable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: John Hawkes <hawkes@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:43 -07:00
Jan Beulich
adf1423698 [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
Current gcc generates calls not jumps to noreturn functions. When that happens the
return address can point to the next function, which confuses the unwinder.

This patch works around it by marking asynchronous exception
frames in contrast normal call frames in the unwind information.  Then teach
the unwinder to decode this.

For normal call frames the unwinder now subtracts one from the address which avoids
this problem.  The standard libgcc unwinder uses the same trick.

It doesn't include adjustment of the printed address (i.e. for the original
example, it'd still be kernel_math_error+0 that gets displayed, but the
unwinder wouldn't get confused anymore.

This only works with binutils 2.6.17+ and some versions of H.J.Lu's 2.6.16
unfortunately because earlier binutils don't support .cfi_signal_frame

[AK: added automatic detection of the new binutils and wrote description]

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:41 +02:00
Arjan van de Ven
3162f751d0 [PATCH] Add the __stack_chk_fail() function
GCC emits a call to a __stack_chk_fail() function when the stack canary is
not matching the expected value.

Since this is a bad security issue; lets panic the kernel rather than limping
along; the kernel really can't be trusted anymore when this happens.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Andi Kleen <ak@suse.de>
2006-09-26 10:52:39 +02:00
Arjan van de Ven
0a42540580 [PATCH] Add the canary field to the PDA area and the task struct
This patch adds the per thread cookie field to the task struct and the PDA.
Also it makes sure that the PDA value gets the new cookie value at context
switch, and that a new task gets a new cookie at task creation time.

Signed-off-by: Arjan van Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Andi Kleen <ak@suse.de>
2006-09-26 10:52:38 +02:00
Andi Kleen
3fa7c794fe [PATCH] Avoid recursion in lockdep when stack tracer takes locks
The new dwarf2 unwinder needs to take locks to do backtraces
inside modules. This patch makes sure lockdep which calls
stacktrace is not reentered.

Thanks to Ingo for suggesting this simpler approach.

Cc: mingo@elte.hu
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:34 +02:00
Andi Kleen
5a1b3999d6 [PATCH] x86: Some preparationary cleanup for stack trace
- Remove unused all_contexts parameter
No caller used it
- Move skip argument into the structure (needed for
followon patches)

Cc: mingo@elte.hu

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:34 +02:00
Andi Kleen
0cb91a2293 [PATCH] i386: Account spinlocks to the caller during profiling for !FP kernels
This ports the algorithm from x86-64 (with improvements) to i386.
Previously this only worked for frame pointer enabled kernels.
But spinlocks have a very simple stack frame that can be manually
analyzed. Do this.

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:28 +02:00
Andi Kleen
3cfc348bf9 [PATCH] x86: Add portable getcpu call
For NUMA optimization and some other algorithms it is useful to have a fast
to get the current CPU and node numbers in user space.

x86-64 added a fast way to do this in a vsyscall. This adds a generic
syscall for other architectures to make it a generic portable facility.

I expect some of them will also implement it as a faster vsyscall.

The cache is an optimization for the x86-64 vsyscall optimization. Since
what the syscall returns is an approximation anyways and user space
often wants very fast results it can be cached for some time.  The norma
methods to get this information in user space are relatively slow

The vsyscall is in a better position to manage the cache because it has direct
access to a fast time stamp (jiffies). For the generic syscall optimization
it doesn't help much, but enforce a valid argument to keep programs
portable

I only added an i386 syscall entry for now. Other architectures can follow
as needed.

AK: Also added some cleanups from Andrew Morton

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:28 +02:00
Don Zickus
8da5adda91 [PATCH] x86: Allow users to force a panic on NMI
To quote Alan Cox:

The default Linux behaviour on an NMI of either memory or unknown is to
continue operation. For many environments such as scientific computing
it is preferable that the box is taken out and the error dealt with than
an uncorrected parity/ECC error get propogated.

A small number of systems do generate NMI's for bizarre random reasons
such as power management so the default is unchanged. In other respects
the new proc/sys entry works like the existing panic controls already in
that directory.

This is separate to the edac support - EDAC allows supported chipsets to
handle ECC errors well, this change allows unsupported cases to at least
panic rather than cause problems further down the line.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:27 +02:00
Don Zickus
407984f1af [PATCH] x86: Add abilty to enable/disable nmi watchdog with sysctl
Adds a new /proc/sys/kernel/nmi call that will enable/disable the nmi
watchdog.

Signed-off-by:  Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:27 +02:00
Don Zickus
2fbe7b25c8 [PATCH] i386/x86-64: Remove un/set_nmi_callback and reserve/release_lapic_nmi functions
Removes the un/set_nmi_callback and reserve/release_lapic_nmi functions as
they are no longer needed.  The various subsystems are modified to register
with the die_notifier instead.

Also includes compile fixes by Andrew Morton.

Signed-off-by:  Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:27 +02:00
David Brownell
1d3a82af45 PM: no suspend_prepare() phase
Remove the new suspend_prepare() phase.  It doesn't seem very usable,
has never been tested, doesn't address fault cleanup, and would need
a sibling resume_complete(); plus there are no real use cases.  It
could be restored later if those issues get resolved.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:38 -07:00
David Brownell
2bca293e56 PM: add kconfig option for deprecated .../power/state files
Add a new PM_SYSFS_DEPRECATED config option to control whether or
not the /sys/devices/.../power/state files are provided.  This will
make it easier to get rid of that mechanism when the time comes,
and to verify that userspace tools work right without it.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:37 -07:00
David Brownell
f1cc0a894c PM: issue PM_EVENT_PRETHAW
This patch is the first of this series that should actually change any
behavior ...  by issuing the new event, now tha the rest of the kernel is
prepared to receive it.

This converts the PM core to issue the new PRETHAW message, which the rest of
the kernel is now ready to receive.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:37 -07:00
Linus Torvalds
7c8265f510 Suspend infrastructure cleanup and extension
Allow devices to participate in the suspend process more intimately,
in particular, allow the final phase (with interrupts disabled) to
also be open to normal devices, not just system devices.

Also, allow classes to participate in device suspend.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:36 -07:00
Ed Swierk
1cc5f7142e [PATCH] load_module: no BUG if module_subsys uninitialized
Invoking load_module() before param_sysfs_init() is called crashes in
mod_sysfs_setup(), since the kset in module_subsys is not initialized yet.

In my case, net-pf-1 is getting modprobed as a result of hotplug trying to
create a UNIX socket.  Calls to hotplug begin after the topology_init
initcall.

Another patch for the same symptom (module_subsys-initialize-earlier.patch)
moves param_sysfs_init() to the subsys initcalls, but this is still not
early enough in the boot process in some cases.  In particular,
topology_init() causes /sbin/hotplug to run, which requests net-pf-1 (the
UNIX socket protocol) which can be compiled as a module.  Moving
param_sysfs_init() to the postcore initcalls fixes this particular race,
but there might well be other cases where a usermodehelper causes a module
to load earlier still.

The patch makes load_module() return an error rather than crashing the
kernel if invoked before module_subsys is initialized.

Cc: Mark Huang <mlhuang@cs.princeton.edu>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-25 17:38:36 -07:00
Thomas Graf
fe4944e59c [NETLINK]: Extend netlink messaging interface
Adds:
 nlmsg_get_pos()                 return current position in message
 nlmsg_trim()                    trim part of message
 nla_reserve_nohdr(skb, len)     reserve room for an attribute w/o hdr
 nla_put_nohdr(skb, len, data)   add attribute w/o hdr
 nla_find_nested()               find attribute in nested attributes

Fixes nlmsg_new() to take allocation flags and consider size.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:53:43 -07:00
Russell King
b36e4758dc [ARM] Fix kernel/fork.c for lockdep on ARM
ARM has interrupts enabled over context switches (iow, has
__ARCH_WANT_INTERRUPTS_ON_CTXSW defined.)  The lockdep code in fork.c
 assumes that interrupts are always disabled.  Fix this wrong
assumption by making the initialisation of 'p->hardirqs_enabled'
depend on __ARCH_WANT_INTERRUPTS_ON_CTXSW.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-09-20 14:58:35 +01:00
Ingo Molnar
86998aa653 [PATCH] genirq core: fix handle_level_irq()
while porting the -rt tree to 2.6.18-rc7 i noticed the following
screaming-IRQ scenario on an SMP system:

 2274  0Dn.:1 0.001ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.010ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.020ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.029ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.039ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.048ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.058ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.068ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.077ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.087ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.097ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)

as it turns out, the bug is caused by handle_level_irq(), which if it
races with another CPU already handling this IRQ, it _unmasks_ the IRQ
line on the way out. This is not how 2.6.17 works, and we introduced
this bug in one of the early genirq cleanups right before it went into
-mm. (the bug was not in the genirq patchset for a long time, and we
didnt notice the bug due to the lack of -rt rebase to the new genirq
code. -rt, and hardirq-preemption in particular opens up such races much
wider than anything else.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-19 07:57:20 -07:00
Kenneth Lee
e4b69aa2a1 [PATCH] bug fix in kernel/kmod.c
I think there is a bug in kmod.c: In __call_usermodehelper(), when
kernel_thread(wait_for_helper, ...) return success, since wait_for_helper()
might call complete() at any time, the sub_info should not be used any
more.

Normally wait_for_helper() take a long time to finish, you may not get
problem for most of the case.  But if you remove /sbin/modprobe, it may
become easier for you to get a oop in khelper.

Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-16 12:54:32 -07:00
Imre Deak
e1ed7ac77b [PATCH] genirq: fix typo in IRQ resend
Fix a bug where the IRQ_PENDING flag is never cleared and the ISR is called
endlessly without an actual interrupt.

Signed-off-by: Imre Deak <imre.deak@solidboot.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-16 12:54:30 -07:00
Oleg Nesterov
dd9daa221e [PATCH] rcu_do_batch: make ->qlen decrement irq safe
rcu_do_batch() decrements rdp->qlen with irqs enabled.  This is not good,
it can also be modified by call_rcu() from interrupt.

Decrement ->qlen once with irqs disabled, after a main loop.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-13 07:32:14 -07:00
Ingo Molnar
9bb25bf36f [PATCH] lockdep: double the number of stack-trace entries
Miles Lane reported the "BUG: MAX_STACK_TRACE_ENTRIES too low!" message,
which means that during normal use his system produced enough lockdep
events so that the 128-thousand entries stack-trace array got exhausted.
Double the size of the array.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-13 07:32:14 -07:00
Al Viro
55669bfa14 [PATCH] audit: AUDIT_PERM support
add support for AUDIT_PERM predicate

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:30 -04:00
Amy Griffis
5974501e2d [PATCH] update audit rule change messages
Make the audit message for implicit rule removal more informative.
Make the rule update message consistent with other messages.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:17 -04:00
Amy Griffis
8ef2d3040e [PATCH] sanity check audit_buffer
Add sanity checks for NULL audit_buffer consistent with other
audit_log* routines.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:17 -04:00
Steve Grubb
3b33ac3182 [PATCH] fix ppid bug in 2.6.18 kernel
Hello,

During some troubleshooting, I found that ppid was accidentally omitted from
the legacy rule section. This resulted in EINVAL for any rule with ppid sent
with AUDIT_ADD.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:04 -04:00
Thomas Gleixner
c5780e976e [PATCH] Use the correct restart option for futex_lock_pi
The current implementation of futex_lock_pi returns -ERESTART_RESTARTBLOCK
in case that the lock operation has been interrupted by a signal.  This
results in a return of -EINTR to userspace in case there is an handler for
the signal.  This is wrong, because userspace expects that the lock
function does not return in any case of signal delivery.

This was not caught by my insufficient test case, but triggered a nasty
userspace problem in an high load application scenario.  Unfortunately also
glibc does not check for this invalid return value.

Using -ERSTARTNOINTR makes sure, that the interrupted syscall is restarted.
 The restart block related code can be safely removed, as the possible
timeout argument is an absolute time value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-08 10:22:50 -07:00
Ingo Molnar
068c4579fe [PATCH] lockdep: do not touch console state when tainting the kernel
Remove an unintended console_verbose() side-effect from add_taint().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-06 11:00:02 -07:00
Pavel Machek
471b40d0df [PATCH] prevent swsusp with PAE
PAE + swsusp results in hard-to-debug crash about 50% of time during
resume.  Cause is known, fix needs to be ported from x86-64 (but we can't
make it to 2.6.18, and I'd like this to be worked around in 2.6.18).

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-06 11:00:02 -07:00
Jarek Poplawski
fc47e7b592 [PATCH] lockdep ifdef fix
With

	CONFIG_SMP=y
	CONFIG_PREEMPT=y
	CONFIG_LOCKDEP=y
	CONFIG_DEBUG_LOCK_ALLOC=y
	# CONFIG_PROVE_LOCKING is not set

spin_unlock_irqrestore() goes through lockdep but spin_lock_irqsave() doesn't.
Apparently, bad things happen.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-06 11:00:01 -07:00
Oleg Nesterov
3b6362b833 [PATCH] eligible_child: remove an obsolete ->tgid check
It is not possible to find a sub-thread in ->children/->ptrace_children
lists, ptrace_attach() does not allow to attach to sub-threads.

Even if it was possible to ptrace the task from the same thread group,
we can't allow to release ->group_leader while there are others (ptracer)
threads in the same group.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-02 14:51:27 -07:00
Henrik Kretzschmar
43a1dd502f [PATCH] kerneldoc for handle_bad_irq()
Adds the description of the parameters from handle_bad_irq().

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-01 11:39:09 -07:00
Shailabh Nagar
35df17c57c [PATCH] task delay accounting fixes
Cleanup allocation and freeing of tsk->delays used by delay accounting.
This solves two problems reported for delay accounting:

1. oops in __delayacct_blkio_ticks
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1844.html

Currently tsk->delays is getting freed too early in task exit which can
cause a NULL tsk->delays to get accessed via reading of /proc/<tgid>/stats.
 The patch fixes this problem by freeing tsk->delays closer to when
task_struct itself is freed up.  As a result, it also eliminates the use of
tsk->delays_lock which was only being used (inadequately) to safeguard
access to tsk->delays while a task was exiting.

2. Possible memory leak in kernel/delayacct.c
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1389.html

The patch cleans up tsk->delays allocations after a bad fork which was
missing earlier.

The patch has been tested to fix the problems listed above and stress
tested with rapid calls to delay accounting's taskstats command interface
(which is the other path that can access the same data, besides the /proc
interface causing the oops above).

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-01 11:39:08 -07:00
Nick Piggin
0d673a5a47 [PATCH] cpuset: oom panic fix
cpuset_excl_nodes_overlap always returns 0 if current is exiting.  This caused
customer's systems to panic in the OOM killer when processes were having
trouble getting memory for the final put_user in mm_release.  Even though
there were lots of processes to kill.

Change to returning 1 in this case.  This achieves parity with !CONFIG_CPUSETS
case, and was observed to fix the problem.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:32 -07:00
Paul Jackson
4c4d50f7b3 [PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map
Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier.  Make
this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset.  This is a surprising regression over earlier kernels
that didn't have cpusets enabled.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map.  Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.

Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:32 -07:00
Yingchao Zhou
4edb9a143e [PATCH] Remove redundant up() in stop_machine()
An up() is called in kernel/stop_machine.c on failure, and also in the
caller (unconditionally).

Signed-off-by: Zhou Yingchao <yingchao.zhou@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:31 -07:00
Oleg Nesterov
d015baebba [PATCH] futex_find_get_task(): remove an obscure EXIT_ZOMBIE check
futex_find_get_task:

	if (p->state == EXIT_ZOMBIE || p->exit_state == EXIT_ZOMBIE)
		return NULL;

I can't understand this.  First, p->state can't be EXIT_ZOMBIE.  The
->exit_state check looks strange too.  Sub-threads or tasks whose ->parent
ignores SIGCHLD go directly to EXIT_DEAD state (I am ignoring a ptrace
case).  Why EXIT_DEAD tasks should be ok?  Yes, EXIT_ZOMBIE is more
important (a task may stay zombie for a long time), but this doesn't mean
we should explicitely ignore other EXIT_XXX states.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:30 -07:00
Oleg Nesterov
f8986c241d [PATCH] revert "Drop tasklist lock in do_sched_setscheduler"
sched_setscheduler() looks at ->signal->rlim[].  It is unsafe do
dereference ->signal unless tasklist_lock or ->siglock is held (or p ==
current).  We pin the task structure, but this can't prevent from
release_task()->__exit_signal() which sets ->signal = NULL.

Restore tasklist_lock across the setscheduler call.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:29 -07:00
Andrew Morton
9b41ea7289 [PATCH] workqueue: remove lock_cpu_hotplug()
Use a private lock instead.  It protects all per-cpu data structures in
workqueue.c, including the workqueues list.

Fix a bug in schedule_on_each_cpu(): it was forgetting to lock down the
per-cpu resources.

Unfixed long-standing bug: if someone unplugs the CPU identified by
`singlethread_cpu' the kernel will get very sick.

Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:29 -07:00
john stultz
e579dcbf23 [PATCH] futex_handle_fault always fails
We found this issue last week w/ the -RT kernel, but it seems the same
issue is in mainline as well.

Basically it is possible for futex_unlock_pi to return without actually
freeing the lock.  This is due to buggy logic in the use of
futex_handle_fault() and its attempt argument in a failure case.

Looking at futex.c the logic is as follows:

1) In futex_unlock_pi() we start w/ ret=0 and we go down to the first
   futex_atomic_cmpxchg_inatomic(), where we find uval==-EFAULT.  We then
   jump to the pi_faulted label.

2) From pi_faulted: We increment attempt, unlock the sem and hit the
   retry label.

3) From the retry label, with ret still zero, we again hit EFAULT on the
   first futex_atomic_cmpxchg_inatomic(), and again goto the pi_faulted
   label.

4) Again from pi_faulted: we increment attempt and enter the
   conditional, where we call futex_handle_fault.

5) futex_handle_fault fails, and we goto the out_unlock_release_sem
   label.

6) From out_unlock_release_sem we return, and since ret is still zero,
   we return without error, while never actually unlocking the lock.

Issue #1: at the first futex_atomic_cmpxchg_inatomic() we should probably
be setting ret=-EFAULT before jumping to pi_faulted: However in our case
this doesn't really affect anything, as the glibc we're using ignores the
error value from futex_unlock_pi().

Issue #2: Look at futex_handle_fault(), its first conditional will return
-EFAULT if attempt is >= 2.  However, from the "if(attempt++)
futex_handle_fault(attempt)" logic above, we'll *never* call
futex_handle_fault when attempt is less then two.  So we never get a chance
to even try to fault the page in.

The following patch addresses these two issues by 1) Always setting ret to
-EFAULT if futex_handle_fault fails, and 2) Removing the = in
futex_handle_fault's (attempt >= 2) check.

I'm really not sure this is the right fix, but wanted to bring it up so
folks knew the issue is alive and well in the current -git tree.  From
looking at the git logs the logic was first introduced (then later copied
to other places) in the following commit almost a year ago:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4732efbeb997189d9f9b04708dc26bf8613ed721;hp=5b039e681b8c5f30aac9cc04385cc94be45d0823

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:29 -07:00
Kirill Korotaev
6997a6faaa [PATCH] sys_getppid oopses on debug kernel
sys_getppid() optimization can access a freed memory.  On kernels with
DEBUG_SLAB turned ON, this results in Oops.  As Dave Hansen noted, this
optimization is also unsafe for memory hotplug.

So this patch always takes the lock to be safe.

[oleg@tv-sign.ru: simplifications]
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: <stable@kernel.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:29 -07:00
Andrew Morton
657b3010d8 [PATCH] panic.c build fix
kernel/panic.c: In function 'add_taint':
kernel/panic.c:176: warning: implicit declaration of function 'debug_locks_off'

Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:28 -07:00
Jan Blunck
3773dc9205 [PATCH] fix hrtimer percpu usage typo
The percpu variable is used incorrectly in switch_hrtimer_base().

Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:28 -07:00
Thomas Gleixner
ce2c6b5384 [PATCH] futex: Apply recent futex fixes to futex_compat
The recent fixups in futex.c need to be applied to futex_compat.c too.  Fixes
a hang reported by Olaf.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:49 -07:00
KAMEZAWA Hiroyuki
58c1b5b079 [PATCH] memory hotadd fixes: find_next_system_ram catch range fix
find_next_system_ram() is used to find available memory resource at onlining
newly added memory.  This patch fixes following problem.

find_next_system_ram() cannot catch this case.

Resource:      (start)-------------(end)
Section :                (start)-------------(end)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:48 -07:00
KAMEZAWA Hiroyuki
0f04ab5efb [PATCH] memory hotadd fixes: change find_next_system_ram's return value manner
find_next_system_ram() returns valid memory range which meets requested area,
only used by memory-hot-add.

This function always rewrite requested resource even if returned area is not
fully fit in requested one.  And sometimes the returnd resource is larger than
requested area.  This annoyes the caller.  This patch changes the returned
value to fit in requested area.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:48 -07:00
Antonino A. Daplas
78944e549d [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
Reported by: Dave Jones

Whilst printk'ing to both console and serial console, I got this...
(2.6.18rc1)

BUG: sleeping function called from invalid context at kernel/sched.c:4438
in_atomic():0, irqs_disabled():1

Call Trace:
 [<ffffffff80271db8>] show_trace+0xaa/0x23d
 [<ffffffff80271f60>] dump_stack+0x15/0x17
 [<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4
 [<ffffffff8029232e>] __cond_resched+0x15/0x55
 [<ffffffff80267eb8>] cond_resched+0x3b/0x42
 [<ffffffff80268c64>] console_conditional_schedule+0x12/0x14
 [<ffffffff80368159>] fbcon_redraw+0xf6/0x160
 [<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52
 [<ffffffff803a43c4>] scrup+0x6b/0xd6
 [<ffffffff803a4453>] lf+0x24/0x44
 [<ffffffff803a7ff8>] vt_console_print+0x166/0x23d
 [<ffffffff80295528>] __call_console_drivers+0x65/0x76
 [<ffffffff80295597>] _call_console_drivers+0x5e/0x62
 [<ffffffff80217e3f>] release_console_sem+0x14b/0x232
 [<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6
 [<ffffffff80251e3f>] run_workqueue+0xa8/0xfb
 [<ffffffff8024e5e0>] worker_thread+0xef/0x122
 [<ffffffff8023660f>] kthread+0x100/0x136
 [<ffffffff8026419e>] child_rip+0x8/0x12

This can occur when release_console_sem() is called but the log
buffer still has contents that need to be flushed. The console drivers
are called while the console_may_schedule flag is still true. The
might_sleep() is triggered when fbcon calls console_conditional_schedule().

Fix by setting console_may_schedule to zero earlier, before the call to the
console drivers.

Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:47 -07:00
Chuck Ebbert
9f59ce5d0e [PATCH] ptrace: make pid of child process available for PTRACE_EVENT_VFORK_DONE
When delivering PTRACE_EVENT_VFORK_DONE, provide pid of the child process
when tracer calls ptrace(PTRACE_GETEVENTMSG).  This is already
(accidentally) available when the tracer is tracing VFORK in addition to
VFORK_DONE.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Daniel Jacobowitz <dan@debian.org>
Cc: Albert Cahalan <acahalan@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:46 -07:00
Christian Borntraeger
e91467ecd1 [PATCH] bug in futex unqueue_me
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
pointers.

On my s390x system I saw the following oops:

Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU:    0    Not tainted
Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
           0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
           000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
           00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
Call Trace:
([<0000000000144f90>] unqueue_me+0x40/0xe4)
 [<0000000000145a0c>] do_futex+0x33c/0xc40
 [<000000000014643e>] sys_futex+0x12e/0x144
 [<000000000010bb00>] sysc_noemu+0x10/0x16
 [<000002000003741c>] 0x2000003741c

The code in question is:

static int unqueue_me(struct futex_q *q)
{
        int ret = 0;
        spinlock_t *lock_ptr;

        /* In the common case we don't take the spinlock, which is nice. */
 retry:
        lock_ptr = q->lock_ptr;
        if (lock_ptr != 0) {
                spin_lock(lock_ptr);
		/*
                 * q->lock_ptr can change between reading it and
                 * spin_lock(), causing us to take the wrong lock.  This
                 * corrects the race condition.
[...]

and my compiler (gcc 4.1.0) makes the following out of it:

00000000000003c8 <unqueue_me>:
     3c8:       eb bf f0 70 00 24       stmg    %r11,%r15,112(%r15)
     3ce:       c0 d0 00 00 00 00       larl    %r13,3ce <unqueue_me+0x6>
                        3d0: R_390_PC32DBL      .rodata+0x2a
     3d4:       a7 f1 1e 00             tml     %r15,7680
     3d8:       a7 84 00 01             je      3da <unqueue_me+0x12>
     3dc:       b9 04 00 ef             lgr     %r14,%r15
     3e0:       a7 fb ff d0             aghi    %r15,-48
     3e4:       b9 04 00 b2             lgr     %r11,%r2
     3e8:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
		/* write q->lock_ptr in r12 */
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       a7 84 00 4b             je      48e <unqueue_me+0xc6>
		/* if r12 is zero then jump over the code.... */
     3fc:       e3 20 b0 28 00 04       lg      %r2,40(%r11)
		/* write q->lock_ptr in r2 */
     402:       c0 e5 00 00 00 00       brasl   %r14,402 <unqueue_me+0x3a>
                        404: R_390_PC32DBL      _spin_lock+0x2
		/* use r2 as parameter for spin_lock */

So the code becomes more or less:
if (q->lock_ptr != 0) spin_lock(q->lock_ptr)
instead of
if (lock_ptr != 0) spin_lock(lock_ptr)

Which caused the oops from above.
After adding a barrier gcc creates code without this problem:
[...] (the same)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       b9 04 00 2c             lgr     %r2,%r12
     3fc:       a7 84 00 48             je      48c <unqueue_me+0xc4>
     400:       c0 e5 00 00 00 00       brasl   %r14,400 <unqueue_me+0x38>
                        402: R_390_PC32DBL      _spin_lock+0x2

As a general note, this code of unqueue_me seems a bit fishy. The retry logic
of unqueue_me only works if we can guarantee, that the original value of
q->lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
know that q->lock_ptr can change. I dont know what happens with the original
spinlock, as I am not an expert with the futex code.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@timesys.com>
Signed-off-by: Christian Borntraeger <borntrae@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:46 -07:00
Rafael J. Wysocki
a7ef7878ea [PATCH] Make suspend possible with a traced process at a breakpoint
It should be possible to suspend, either to RAM or to disk, if there's a
traced process that has just reached a breakpoint.  However, this is a
special case, because its parent process might have been frozen already and
then we are unable to deliver the "freeze" signal to the traced process.
If this happens, it's better to cancel the freezing of the traced process.

Ref. http://bugzilla.kernel.org/show_bug.cgi?id=6787

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:45 -07:00
Al Viro
3f2792ffbd [PATCH] take filling ->pid, etc. out of audit_get_context()
move that stuff downstream and into the only branch where it'll be
used.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:59:51 -04:00
Al Viro
5ac3a9c26c [PATCH] don't bother with aux entires for dummy context
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:59:42 -04:00
Al Viro
d51374adf5 [PATCH] mark context of syscall entered with no rules as dummy
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:59:26 -04:00
Al Viro
471a5c7c83 [PATCH] introduce audit rules counter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:55:18 -04:00
Amy Griffis
5422e01ac1 [PATCH] fix audit oops with invalid operator
Michael C Thompson wrote:  [Tue Aug 01 2006, 02:36:36PM EDT]
> The trigger for this oops is:
> # auditctl -a exit,always -S pread64 -F 'inode<1'

Setting the err value will fix it.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:54:43 -04:00
Amy Griffis
6988434ee5 [PATCH] fix oops with CONFIG_AUDIT and !CONFIG_AUDITSYSCALL
Always initialize the audit_inode_hash[] so we don't oops on list rules.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:50:39 -04:00
Amy Griffis
73d3ec5aba [PATCH] fix missed create event for directory audit
When an object is created via a symlink into an audited directory, audit misses
the event due to not having collected the inode data for the directory.  Modify
__audit_inode_child() to copy the parent inode data if a parent wasn't found in
audit_names[].

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:50:30 -04:00
Amy Griffis
3e2efce067 [PATCH] fix faulty inode data collection for open() with O_CREAT
When the specified path is an existing file or when it is a symlink, audit
collects the wrong inode number, which causes it to miss the open() event.
Adding a second hook to the open() path fixes this.

Also add audit_copy_inode() to consolidate some code.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:50:21 -04:00
Linus Torvalds
ae74c3b69a Fix force_sig_info() semantics after cleanups
Suresh points out that commit b0423a0d9c
broke the semantics of a synchronous signal like SIGSEGV occurring
recursively inside its own handler handler (or, indeed, any other
context when the signal was blocked).

That was unintentional, and this fixes things up by reinstating the old
semantics, but without reverting the cleanups.

Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-02 20:17:49 -07:00
Josh Triplett
51d8c5edd3 [PATCH] timer: Fix tvec_bases initializer
kernel/timer.c defines a (per-cpu) pointer to tvec_base_t, but initializes
it using { &a_tvec_base_t }, which sparse warns about; change this to just
&a_tvec_base_t.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:44 -07:00
Steven Rostedt
d07fe82c24 [PATCH] reference rt-mutex-design in rtmutex.c
In order to prevent Doc Rot, this patch adds a reference to the design
document for rtmutex.c in rtmutex.c.  So when someone needs to update or
change the design of that file they will know that a document actually
exists that explains the design (helping them change it), and hopefully
that they will update the document if they too change the design.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:43 -07:00
Tim Chen
3c829c367a [PATCH] Reducing local_bh_enable/disable overhead in irqtrace
The recent changes from irqtrace feature has added overheads to
local_bh_disable and local_bh_enable that reduces UDP performance across
x86_64 and IA64, even though IA64 does not support the irqtrace feature.
Patch in question is

[PATCH]lockdep: irqtrace subsystem, core
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=c
ommit;h=de30a2b355ea85350ca2f58f3b9bf4e5bc007986

Prior to this patch, local_bh_disable was a short macro.  Now it is a
function which calls __local_bh_disable with added irq flags save and
restore.  The irq flags save and restore were also added to
local_bh_enable, probably for injecting the trace irqs code.

This overhead is on the generic code path across all architectures.  On a
IA_64 test machine (Itanium-2 1.6 GHz) running a benchmark like netperf's
UDP streaming test, the added overhead results in a drop of 3% in
throughput, as udp_sendmsg calls the local_bh_enable/disable several times.

Other workloads that have heavy usages of local_bh_enable/disable could
also be affected.  The patch ideally should not have affected IA-64
performance as it does not have IRQ tracing support.  A significant portion
of the overhead is in the added irq flags save and restore, which I think
is not needed if IRQ tracing is unused.  A suggested patch is attached
below that recovers the lost performance.  However, the "ifdef"s in the
patch are a bit ugly.

Signed-off-by: Tim Chen <tim.c.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:42 -07:00
Heiko Carstens
b50f60ceee [PATCH] pi-futex: missing pi_waiters plist initialization
Initialize init task's pi_waiters plist.  Otherwise cpu hotplug of cpu 0
might crash, since rt_mutex_getprio() accesses an uninitialized list head.

call chain which led to crash:

take_cpu_down
sched_idle_next
__setscheduler
rt_mutex_getprio

Using PLIST_HEAD_INIT in the INIT_TASK macro doesn't work unfortunately,
since the pi_waiters member is only conditionally present.

Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:41 -07:00
Rolf Eike Beer
0fcb78c22f [PATCH] Add DocBook documentation for workqueue functions
kernel/workqueue.c was omitted from generating kernel documentation.  This
adds a new section "Workqueues and Kevents" and adds documentation for some
of the functions.

Some functions in this file already had DocBook-style comments, now they
finally become visible.

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:40 -07:00
Jim Houston
2d7d253548 [PATCH] fix cond_resched() fix
In cond_resched_lock() it calls __resched_legal() before dropping the spin
lock.  __resched_legal() will always finds the preempt_count non-zero and
will prevent the call to __cond_resched().

The attached patch adds a parameter to __resched_legal() with the expected
preempt_count value.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:40 -07:00
Steven Rostedt
6ea24f9ad1 [PATCH] fix bad macro param in timer.c
We have

#define INDEX(N) (base->timer_jiffies >> (TVR_BITS + N * TVN_BITS)) & TVN_MASK

and it's used via

	list = varray[i + 1]->vec + (INDEX(i + 1));

So, due to underparenthesisation, this INDEX(i+1) is now a ...  (TVR_BITS + i
+ 1 * TVN_BITS)) ...

So this bugfix changes behaviour.  It worked before by sheer luck:

  "If i was anything but 0, it was broken.  But this was only used by
   s390 and arm.  Since it was for the next interrupt, could that next
   interrupt be a problem (going into the second cascade)? But it was
   probably seldom wrong.  That is, this would fail if the next
   interrupt was in the second cascade, and was wrapped.  Which may
   never of happened.  Also if it did happen, it would have just missed
   the interrupt.

   If an interrupt was missed, and no one was there to miss it, was it
   really missed :-)"

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:40 -07:00
Chandra Seetharaman
8c78f3075d [PATCH] cpu hotplug: replace __devinit* with __cpuinit* for cpu notifications
Few of the callback functions and notifier blocks that are associated with cpu
notifications incorrectly have __devinit and __devinitdata.  They should be
__cpuinit and __cpuinitdata instead.

It makes no functional difference but wastes text area when CONFIG_HOTPLUG is
enabled and CONFIG_HOTPLUG_CPU is not.

This patch fixes all those instances.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:39 -07:00
bibo, mao
a9ad965ea9 [PATCH] IA64: kprobe invalidate icache of jump buffer
Kprobe inserts breakpoint instruction in probepoint and then jumps to
instruction slot when breakpoint is hit, the instruction slot icache must
be consistent with dcache.  Here is the patch which invalidates instruction
slot icache area.

Without this patch, in some machines there will be fault when executing
instruction slot where icache content is inconsistent with dcache.

Signed-off-by: bibo,mao <bibo.mao@intel.com>
Acked-by: "Luck, Tony" <tony.luck@intel.com>
Acked-by: Keshavamurthy Anil S <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:38 -07:00
Shailabh Nagar
163ecdff06 [PATCH] delay accounting: temporarily enable by default
Enable delay accounting by default so that feature gets coverage testing
without requiring special measures.

Earlier, it was off by default and had to be enabled via a boot time param.
 This patch reverses the default behaviour to improve coverage testing.  It
can be removed late in the kernel development cycle if its believed users
shouldn't have to incur any cost if they don't want delay accounting.  Or
it can be retained forever if the utility of the stats is deemed common
enough to warrant keeping the feature on.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:37 -07:00
Shailabh Nagar
d94a041519 [PATCH] taskstats: free skb, avoid returns in send_cpu_listeners
Add a missing freeing of skb in the case there are no listeners at all.
Also remove the returning of error values by the function as it is unused
by the sole caller.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:37 -07:00
Shailabh Nagar
7d94dddd43 [PATCH] make taskstats sending completely independent of delay accounting on/off status
Complete the separation of delay accounting and taskstats by ignoring the
return value of delay accounting functions that fill in parts of taskstats
before it is sent out (either in response to a command or as part of a task
exit).

Also make delayacct_add_tsk return silently when delay accounting is turned
off rather than treat it as an error.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:37 -07:00
David Brownell
15a647eba9 [PATCH] genirq: {en,dis}able_irq_wake() need refcounting too
IRQs need refcounting and a state flag to track whether the the IRQ should
be enabled or disabled as a "normal IRQ" source after a series of calls to
{en,dis}able_irq().  For shared IRQs, the IRQ must be enabled so long as at
least one driver needs it active.

Likewise, IRQs need the same support to track whether the IRQ should be
enabled or disabled as a "wakeup event" source after a series of calls to
{en,dis}able_irq_wake().  For shared IRQs, the IRQ must be enabled as a
wakeup source during sleep so long as at least one driver needs it.  But
right now they _don't have_ that refcounting ...  which means sharing a
wakeup-capable IRQ can't work correctly in some configurations.

This patch adds the refcount and flag mechanisms to set_irq_wake() -- which
is what {en,dis}able_irq_wake() call -- and minimal documentation of what
the irq wake mechanism does.

Drivers relying on the older (broken) "toggle" semantics will trigger a
warning; that'll be a handful of drivers on ARM systems.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:36 -07:00
Siddha, Suresh B
f712c0c7e1 [PATCH] sched: build_sched_domains() fix
Use the correct groups while initializing sched groups power for
allnodes_domain.  This fixes the crash observed while creating exclusive
cpusets.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Reported-and-tested-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:36 -07:00
Ingo Molnar
e3f2ddeac7 [PATCH] pi-futex: robust-futex exit
Fix robust PI-futexes to be properly unlocked on unexpected exit.

For this to work the kernel has to know whether a futex is a PI or a
non-PI one, because the semantics are different.  Since the space in
relevant glibc data structures is extremely scarce, the best solution is
to encode the 'PI' information in bit 0 of the robust list pointer.
Existing (non-PI) glibc robust futexes have this bit always zero, so the
ABI is kept.  New glibc with PI-robust-futexes will set this bit.

Further fixes from Thomas Gleixner <tglx@linutronix.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-28 21:02:00 -07:00
Ingo Molnar
627371d73c [PATCH] pi-futex: robust-futex exit crash fix
Fix pi_state->list handling bugs: list handling mishap, locking error.
Plus add more debug checks and fix a few style issues i noticed while
debugging this.

(reported by Ulrich Drepper and Jakub Jelinek.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-28 21:02:00 -07:00
Paul Jackson
abb5a5cc6b [PATCH] Cpuset: fix ABBA deadlock with cpu hotplug lock
Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
callback_mutex lock.

It only happens on cpu_exclusive cpusets, due to the dynamic
sched domain code trying to take the cpu hotplug lock inside
the cpuset callback_mutex lock.

This bug has apparently been here for several months, but didn't
get hit until the right customer load on a large system.

This fix appears right from inspection, but it will take a few
more days running it on that customers workload to be confident
we nailed it.  We don't have any other reproducible test case.

The cpu_hotplug_lock() tends to cover large runs of code.
The other places that hold both that lock and the cpuset callback
mutex lock always nest the cpuset lock inside the hotplug lock.
This place tries to do the reverse, risking an ABBA deadlock.

This is in the cpuset_rmdir() code, where we:
  * take the callback_mutex lock
  * mark the cpuset CS_REMOVED
  * call update_cpu_domains for cpu_exclusive cpusets
  * in that call, take the cpu_hotplug lock if the
    cpuset is marked for removal.

Thanks to Jack Steiner for identifying this deadlock.

The fix is to tear down the dynamic sched domain before we grab
the cpuset callback_mutex lock.  This way, the two locks are
serialized, with the hotplug lock taken and released before
trying for the cpuset lock.

I suspect that this bug was introduced when I changed the
cpuset locking from one lock to two.  The dynamic sched domain
dependency on cpu_exclusive cpusets and its hotplug hooks were
added to this code earlier, when cpusets had only a single lock.
It may well have been fine then.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-23 13:03:05 -07:00
Linus Torvalds
aa95387774 cpu hotplug: simplify and hopefully fix locking
The CPU hotplug locking was quite messy, with a recursive lock to
handle the fact that both the actual up/down sequence wanted to
protect itself from being re-entered, but the callbacks that it
called also tended to want to protect themselves from CPU events.

This splits the lock into two (one to serialize the whole hotplug
sequence, the other to protect against the CPU present bitmaps
changing). The latter still allows recursive usage because some
subsystems (ondemand policy for cpufreq at least) had already gotten
too used to the lax locking, but the locking mistakes are hopefully
now less fundamental, and we now warn about recursive lock usage
when we see it, in the hope that it can be fixed.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-23 12:12:16 -07:00
Shailabh Nagar
bb129994c3 [PATCH] Remove down_write() from taskstats code invoked on the exit() path
In send_cpu_listeners(), which is called on the exit path, a down_write()
was protecting operations like skb_clone() and genlmsg_unicast() that do
GFP_KERNEL allocations.  If the oom-killer decides to kill tasks to satisfy
the allocations,the exit of those tasks could block on the same semphore.

The down_write() was only needed to allow removal of invalid listeners from
the listener list.  The patch converts the down_write to a down_read and
defers the removal to a separate critical region.  This ensures that even
if the oom-killer is called, no other task's exit is blocked as it can
still acquire another down_read.

Thanks to Andrew Morton & Herbert Xu for pointing out the oom related
pitfalls, and to Chandra Seetharaman for suggesting this fix instead of
using something more complex like RCU.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
f9fd8914c1 [PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks
On systems with a large number of cpus, with even a modest rate of tasks
exiting per cpu, the volume of taskstats data sent on thread exit can
overflow a userspace listener's buffers.

One approach to avoiding overflow is to allow listeners to get data for a
limited and specific set of cpus.  By scaling the number of listeners
and/or the cpus they monitor, userspace can handle the statistical data
overload more gracefully.

In this patch, each listener registers to listen to a specific set of cpus
by specifying a cpumask.  The interest is recorded per-cpu.  When a task
exits on a cpu, its taskstats data is unicast to each listener interested
in that cpu.

Thanks to Andrew Morton for pointing out the various scalability and
general concerns of previous attempts and for suggesting this design.

[akpm@osdl.org: build fix]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
ad4ecbcba7 [PATCH] delay accounting taskstats interface send tgid once
Send per-tgid data only once during exit of a thread group instead of once
with each member thread exit.

Currently, when a thread exits, besides its per-tid data, the per-tgid data
of its thread group is also sent out, if its thread group is non-empty.
The per-tgid data sent consists of the sum of per-tid stats for all
*remaining* threads of the thread group.

This patch modifies this sending in two ways:

- the per-tgid data is sent only when the last thread of a thread group
  exits.  This cuts down heavily on the overhead of sending/receiving
  per-tgid data, especially when other exploiters of the taskstats
  interface aren't interested in per-tgid stats

- the semantics of the per-tgid data sent are changed.  Instead of being
  the sum of per-tid data for remaining threads, the value now sent is the
  true total accumalated statistics for all threads that are/were part of
  the thread group.

The patch also addresses a minor issue where failure of one accounting
subsystem to fill in the taskstats structure was causing the send of
taskstats to not be sent at all.

The patch has been tested for stability and run cerberus for over 4 hours
on an SMP.

[akpm@osdl.org: bugfixes]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
2589045466 [PATCH] per-task-delay-accounting: /proc export of aggregated block I/O delays
Export I/O delays seen by a task through /proc/<tgid>/stats for use in top
etc.

Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed
together with all other I/O here (this is not the case in the netlink
interface where the swapin I/O is kept distinct)

[akpm@osdl.org: printk warning fix]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
6f44993fe1 [PATCH] per-task-delay-accounting: delay accounting usage of taskstats interface
Usage of taskstats interface by delay accounting.

Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Shailabh Nagar
c757249af1 [PATCH] per-task-delay-accounting: taskstats interface
Create a "taskstats" interface based on generic netlink (NETLINK_GENERIC
family), for getting statistics of tasks and thread groups during their
lifetime and when they exit.  The interface is intended for use by multiple
accounting packages though it is being created in the context of delay
accounting.

This patch creates the interface without populating the fields of the data
that is sent to the user in response to a command or upon the exit of a task.
Each accounting package interested in using taskstats has to provide an
additional patch to add its stats to the common structure.

[akpm@osdl.org: cleanups, Kconfig fix]
Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Chandra Seetharaman
52f17b6c2b [PATCH] per-task-delay-accounting: cpu delay collection via schedstats
Make the task-related schedstats functions callable by delay accounting even
if schedstats collection isn't turned on.  This removes the dependency of
delay accounting on schedstats.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Shailabh Nagar
0ff922452d [PATCH] per-task-delay-accounting: sync block I/O and swapin delay collection
Unlike earlier iterations of the delay accounting patches, now delays are only
collected for the actual I/O waits rather than try and cover the delays seen
in I/O submission paths.

Account separately for block I/O delays incurred as a result of swapin page
faults whose frequency can be affected by the task/process' rss limit.  Hence
swapin delays can act as feedback for rss limit changes independent of I/O
priority changes.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Shailabh Nagar
ca74e92b46 [PATCH] per-task-delay-accounting: setup
Initialization code related to collection of per-task "delay" statistics which
measure how long it had to wait for cpu, sync block io, swapping etc.  The
collection of statistics and the interface are in other patches.  This patch
sets up the data structures and allows the statistics collection to be
disabled through a kernel boot parameter.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Ingo Molnar
3a5f5e488c [PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSW
On platforms that have __ARCH_WANT_UNLOCKED_CTXSW set and want to implement
lock validator support there's a bug in rq->lock handling: in this case we
dont 'carry over' the runqueue lock into another task - but still we did a
spinlock_release() of it.  Fix this by making the spinlock_release() in
context_switch() dependent on !__ARCH_WANT_UNLOCKED_CTXSW.

(Reported by Ralf Baechle on MIPS, which has __ARCH_WANT_UNLOCKED_CTXSW.
This fixes a lockdep-internal BUG message on such platforms.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:55 -07:00
OGAWA Hirofumi
a4afee02a5 [PATCH] Fix sighand->siglock usage in kernel/acct.c
IRQs must be disabled before taking ->siglock.

Noticed by lockdep.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:54 -07:00
john stultz
3e143475c2 [PATCH] improve timekeeping resume robustness
Resolve problems seen w/ APM suspend.

Due to resume initialization ordering, its possible we could get a timer
interrupt before the timekeeping resume() function is called.  This patch
ensures we don't do any timekeeping accounting before we're fully resumed.

(akpm: fixes the machine-freezes-on-APM-resume bug)

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:54 -07:00
Adrian Bunk
6bc02d8412 [PATCH] unexport open_softirq
Christoph Hellwig:
open_softirq just enables a softirq.  The softirq array is statically
allocated so to add a new one you would have to patch the kernel.  So
there's no point to keep this export at all as any user would have to
patch the enum in include/linux/interrupt.h anyway.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:53 -07:00
Luca Tettamanti
60198f9992 [PATCH] Add try_to_freeze() to rt-test kthreads
When CONFIG_RT_MUTEX_TESTER is enabled kernel refuses to suspend the
machine because it's unable to freeze the rt-test-* threads.

Add try_to_freeze() after schedule() so that the threads will be freezed
correctly; I've tested the patch and it lets the notebook suspends and
resumes nicely.

Signed-off-by: Luca Tettamanti <kronos.it@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:53 -07:00
Andrew Morton
a0009652af [PATCH] del_timer_sync(): add cpu_relax()
Relax the CPU in the del_timer_sync() busywait loop.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:52 -07:00
Adrian Bunk
52e92e5788 [PATCH] remove kernel/kthread.c:kthread_stop_sem()
Remove the now-unneeded kthread_stop_sem().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:52 -07:00
Andreas Gruenbacher
098c5eea03 [PATCH] null-terminate over-long /proc/kallsyms symbols
Got a customer bug report (https://bugzilla.novell.com/190296) about kernel
symbols longer than 127 characters which end up in a string buffer that is
not NULL terminated, leading to garbage in /proc/kallsyms.  Using strlcpy
prevents this from happening, even though such symbols still won't come out
right.

A better fix would be to not use a fixed-size buffer, but it's probably not
worth the trouble.  (Modversion'ed symbols even have a length limit of 60.)

[bunk@stusta.de: build fix]
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:52 -07:00
Adrian Bunk
cd6ef2ada5 [PATCH] The scheduled unexport of insert_resource
Implement the scheduled unexport of insert_resource.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12 16:09:08 -07:00
Adrian Bunk
26865e9c26 [PATCH] remove kernel/power/pm.c:pm_unregister_all()
Remove the deprecated and no longer used pm_unregister_all().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12 16:09:08 -07:00
Marcel Holtmann
abf75a5033 [PATCH] Fix prctl privilege escalation and suid_dumpable (CVE-2006-2451)
Based on a patch from Ernie Petrides

During security research, Red Hat discovered a behavioral flaw in core
dump handling. A local user could create a program that would cause a
core file to be dumped into a directory they would not normally have
permissions to write to. This could lead to a denial of service (disk
consumption), or allow the local user to gain root privileges.

The prctl() system call should never allow to set "dumpable" to the
value 2. Especially not for non-privileged users.

This can be split into three cases:

  1) running as root -- then core dumps will already be done as root,
     and so prctl(PR_SET_DUMPABLE, 2) is not useful

  2) running as non-root w/setuid-to-root -- this is the debatable case

  3) running as non-root w/setuid-to-non-root -- then you definitely
     do NOT want "dumpable" to get set to 2 because you have the
     privilege escalation vulnerability

With case #2, the only potential usefulness is for a program that has
designed to run with higher privilege (than the user invoking it) that
wants to be able to create root-owned root-validated core dumps. This
might be useful as a debugging aid, but would only be safe if the program
had done a chdir() to a safe directory.

There is no benefit to a production setuid-to-root utility, because it
shouldn't be dumping core in the first place. If this is true, then the
same debugging aid could also be accomplished with the "suid_dumpable"
sysctl.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-12 12:50:25 -07:00
Arjan van de Ven
2c16e9c888 [PATCH] lockdep: disable lock debugging when kernel state becomes untrusted
Disable lockdep debugging in two situations where the integrity of the
kernel no longer is guaranteed: when oopsing and when hitting a
tainting-condition.  The goal is to not get weird lockdep traces that don't
make sense or are otherwise undebuggable, to not waste time.

Lockdep assumes that the previous state it knows about is valid to operate,
which is why lockdep turns itself off after the first violation it reports,
after that point it can no longer make that assumption.

A kernel oops means that the integrity of the kernel compromised; in
addition anything lockdep would report is of lesser importance than the
oops.

All the tainting conditions are of similar integrity-violating nature and
also make debugging/diagnosing more difficult.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:27 -07:00
Christoph Hellwig
c59923a15c [PATCH] remove the tasklist_lock export
As announced half a year ago this patch will remove the tasklist_lock
export.  The previous two patches got rid of the remaining modular users.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:26 -07:00
Ingo Molnar
21d71f513b [PATCH] uninline init_waitqueue_head()
allyesconfig vmlinux size delta:

  text            data    bss     dec          filename
  20736884        6073834 3075176 29885894     vmlinux.before
  20721009        6073966 3075176 29870151     vmlinux.after

~18 bytes per callsite, 15K of text size (~0.1%) saved.

(as an added bonus this also removes a lockdep annotation.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:25 -07:00
Linus Torvalds
aeceb15738 [PATCH] swsusp: fix panic when signature can't be read
Do not panic a machine when swsusp signature can't be read.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
Andrew Morton
712f403af6 [PATCH] swsusp warning fix
kernel/power/swap.c: In function 'swsusp_write':
kernel/power/swap.c:275: warning: 'start' may be used uninitialized in this function

gcc isn't smart enough, so help it.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
Rafael J. Wysocki
95018f7c94 [PATCH] swsusp: do not use memcpy for snapshotting memory
swsusp should not use memcpy for snapshotting memory, because on some
architectures memcpy may increase preempt_count (i386 does this when
CONFIG_X86_USE_3DNOW is set).  Then, as a result, wrong value of preempt_count
is stored in the image.

Replace memcpy in copy_data_pages with an open-coded loop.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
Roman Zippel
e154ff3d2c [PATCH] adjust clock for lost ticks
A large number of lost ticks can cause an overadjustment of the clock.  To
compensate for this we look at the current error and the larger the error
already is the more careful we are at adjusting the error.  As small extra
fix reset the error when the clock is set.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Cc: Uwe Bugla <uwe.bugla@gmx.de>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:18 -07:00
Thomas Gleixner
06a9ec291b [PATCH] pi-futex: Validate futex type instead of oopsing
Calling futex_lock_pi is called with a reference to a non PI futex and
waiters exist already, lookup_pi_state() oopses due to pi_state == NULL.
Check this condition and return -EINVAL to userspace.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:18 -07:00
Adrian Bunk
80d6679a62 [PATCH] kernel/softirq.c: EXPORT_UNUSED_SYMBOL
This patch marks an unused export as EXPORT_UNUSED_SYMBOL.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:18 -07:00
Adrian Bunk
c0fc84d2e5 [PATCH] kernel/printk.c: EXPORT_SYMBOL_UNUSED
This patch marks unused exports as EXPORT_SYMBOL_UNUSED.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:17 -07:00
Ingo Molnar
d6d897cec2 [PATCH] lockdep: core, reduce per-lock class-cache size
lockdep_map is embedded into every lock, which blows up data structure
sizes all around the kernel.  Reduce the class-cache to be for the default
class only - that is used in 99.9% of the cases and even if we dont have a
class cached, the lookup in the class-hash is lockless.

This change reduces the per-lock dep_map overhead by 56 bytes on 64-bit
platforms and by 28 bytes on 32-bit platforms.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:14 -07:00
Arjan van de Ven
55794a412f [PATCH] lockdep: improve debug output
Make lockdep print which lock is held, in the "kfree() of a live lock"
scenario.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:14 -07:00
Andi Kleen
f9829cceb6 [PATCH] Minor cleanup to lockdep.c
- Use printk formatting for indentation
- Don't leave NTFS in the default event filter

Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:14 -07:00
Andreas Mohr
2ed6e34f88 [PATCH] small kernel/sched.c cleanup
- constify and optimize stat_nam (thanks to Michael Tokarev!)
- spelling and comment fixes

Signed-off-by: Andreas Mohr <andi@lisas.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:13 -07:00
Peter Williams
0a565f7919 [PATCH] sched: fix bug in __migrate_task()
Problem:

In the function __migrate_task(), deactivate_task() followed by
activate_task() is used to move the task from one run queue to
another.  This has two undesirable effects:

1. The task's priority is recalculated. (Nowhere else in the
scheduler code is the priority recalculated for a change of CPU.)

2. The task's time stamp is set to the current time.  At the very least,
this makes the adjustment of the time stamp before the call to
deactivate_task() redundant but I believe the problem is more serious
as the time stamp now holds the time of the queue change instead of
the time at which the task was woken.  In addition, unless dest_rq is
the same queue as "current" is on the time stamp could be inaccurate
due to inter CPU drift.

Solution:

Replace the call to activate_task() with one to __activate_task().

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:13 -07:00
Linus Torvalds
ca78f6baca Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
  Move workqueue exports to where the functions are defined.
  [CPUFREQ] Misc cleanups in ondemand.
  [CPUFREQ] Make ondemand sampling per CPU and remove the mutex usage in sampling path.
  [CPUFREQ] Add queue_delayed_work_on() interface for workqueues.
  [CPUFREQ] Remove slowdown from ondemand sampling path.
2006-07-04 14:00:26 -07:00
Andrew Morton
d8cb7c1ded [PATCH] revert "kthread: convert stop_machine into a kthread"
Jiri reports that the stop_machin kthread conversion caused his machine to
hang when suspending.  Hyperthreading is apparently involved.

I don't see why that would be and I can't reproduce it.  Revert to the 2.6.17
code.

Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 21:25:20 -07:00
Linus Torvalds
912b2539e1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  powerpc: add defconfig for Freescale MPC8349E-mITX board
  powerpc: Add base support for the Freescale MPC8349E-mITX eval board
  Documentation: correct values in MPC8548E SEC example node
  [POWERPC] Actually copy over i8259.c to arch/ppc/syslib this time
  [POWERPC] Add new interrupt mapping core and change platforms to use it
  [POWERPC] Copy i8259 code back to arch/ppc
  [POWERPC] New device-tree interrupt parsing code
  [POWERPC] Use the genirq framework
  [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts
  [POWERPC] Update the SWIM3 (powermac) floppy driver
  [POWERPC] Fix error handling in detecting legacy serial ports
  [POWERPC] Fix booting on Momentum "Apache" board (a Maple derivative)
  [POWERPC] Fix various offb and BootX-related issues
  [POWERPC] Add a default config for 32-bit CHRP machines
  [POWERPC] fix implicit declaration on cell.
  [POWERPC] change get_property to return void *
2006-07-03 15:28:34 -07:00
Ingo Molnar
70b97a7f0b [PATCH] sched: cleanup, convert sched.c-internal typedefs to struct
convert:

 - runqueue_t to 'struct rq'
 - prio_array_t to 'struct prio_array'
 - migration_req_t to 'struct migration_req'

I was the one who added these but they are both against the kernel coding
style and also were used inconsistently at places.  So just get rid of them at
once, now that we are flushing the scheduler patch-queue anyway.

Conversion was mostly scripted, the result was reviewed and all secondary
whitespace and style impact (if any) was fixed up by hand.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:11 -07:00
Ingo Molnar
36c8b58689 [PATCH] sched: cleanup, remove task_t, convert to struct task_struct
cleanup: remove task_t and convert all the uses to struct task_struct. I
introduced it for the scheduler anno and it was a mistake.

Conversion was mostly scripted, the result was reviewed and all
secondary whitespace and style impact (if any) was fixed up by hand.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:11 -07:00
Ingo Molnar
48f24c4da1 [PATCH] sched: clean up fallout of recent changes
Clean up some of the impact of recent (and not so recent) scheduler
changes:

 - turning macros into nice inline functions
 - sanitizing and unifying variable definitions
 - whitespace, style consistency, 80-lines, comment correctness, spelling
   and curly braces police

Due to the macro hell and variable placement simplifications there's even 26
bytes of .text saved:

   text    data     bss     dec     hex filename
  25510    4153     192   29855    749f sched.o.before
  25484    4153     192   29829    7485 sched.o.after

[akpm@osdl.org: build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:10 -07:00
Paul Mackerras
829035fd70 [PATCH] lockdep: irqtrace subsystem, move account_system_vtime() calls into kernel/softirq.c
At the moment, powerpc and s390 have their own versions of do_softirq which
include local_bh_disable() and __local_bh_enable() calls.  They end up
calling __do_softirq (in kernel/softirq.c) which also does
local_bh_disable/enable.

Apparently the two levels of disable/enable trigger a warning from some
validation code that Ingo is working on, and he would like to see the outer
level removed.  But to do that, we have to move the account_system_vtime
calls that are currently in the arch do_softirq() implementations for
powerpc and s390 into the generic __do_softirq() (this is a no-op for other
archs because account_system_vtime is defined to be an empty inline
function on all other archs).  This patch does that.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:10 -07:00
Ingo Molnar
60be6b9a41 [PATCH] lockdep: annotate on-stack completions
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues
implicitly initialized by DECLARE_COMPLETION().  Annotate on-stack completions
accordingly.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Ingo Molnar
366c7f554e [PATCH] lockdep: annotate enable_in_hardirq()
Make use of local_irq_enable_in_hardirq() API to annotate places that enable
hardirqs in hardirq context.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Ingo Molnar
ad33945175 [PATCH] lockdep: annotate ->mmap_sem
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:08 -07:00
Ingo Molnar
5436552448 [PATCH] lockdep: annotate hrtimer base locks
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
fcb993712f [PATCH] lockdep: annotate scheduler runqueue locks
Teach per-CPU runqueue locks and recursive locking code to the lock validator.
 Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
d730e882a1 [PATCH] lockdep: annotate timer base locks
Split the per-CPU timer base locks up into separate lock classes, because they
are used recursively.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
eb4542b98c [PATCH] lockdep: annotate waitqueues
Create one lock class for all waitqueue locks in the kernel.  Has no effect on
non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
243c7621aa [PATCH] lockdep: annotate genirq
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:06 -07:00
Ingo Molnar
8b8f319fc7 [PATCH] lockdep: annotate futex
Teach special (recursive) locking code to the lock validator.  Introduces
double_lock_hb() to unify double- hash-bucket-lock taking.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:06 -07:00
Ingo Molnar
a0f1ccfd8d [PATCH] lockdep: do not recurse in printk
Make printk()-ing from within the lock validation code safer by using the
lockdep-recursion counter.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:05 -07:00
Ingo Molnar
ef5d4707b9 [PATCH] lockdep: prove mutex locking correctness
Use the lock validator framework to prove mutex locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
8a25d5debf [PATCH] lockdep: prove spinlock rwlock locking correctness
Use the lock validator framework to prove spinlock and rwlock locking
correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
4ea2176dfa [PATCH] lockdep: prove rwsem locking correctness
Use the lock validator framework to prove rwsem locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
a8f24a3978 [PATCH] lockdep: procfs
Lock validator /proc/lockdep and /proc/lockdep_stats support.
(FIXME: should go into debugfs)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
6c9076ec9c [PATCH] lockdep: allow read_lock() recursion of same class
From: Ingo Molnar <mingo@elte.hu>

lockdep so far only allowed read-recursion for the same lock instance.
This is enough in the overwhelming majority of cases, but a hostap case
triggered and reported by Miles Lane relies on same-class
different-instance recursion.  So we relax the restriction on read-lock
recursion.

(This change does not allow rwsem read-recursion, which is still
forbidden.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
fbb9ce9530 [PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.

Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.

What does the lock validator do?  It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems).  Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules.  If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal.  If the
new rule could create a deadlock scenario then this condition is printed out.

When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios.  In a typical system this means millions of separate
scenarios.  This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem).  [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]

Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically.  In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!).  So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself!  In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.

To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class".  For example, all struct inode objects
in the kernel have inode->inotify_mutex.  If there are 10,000 inodes cached,
then there are 10,000 lock objects.  But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class.  The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules.  The
set of rules persist during the lifetime of the kernel.

To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:

 lock-classes:                            694 [max: 2048]
 direct dependencies:                  1598 [max: 8192]
 indirect dependencies:               17896
 all direct dependencies:             16206
 dependency chains:                    1910 [max: 8192]
 in-hardirq chains:                      17
 in-softirq chains:                     105
 in-process chains:                    1065
 stack-trace entries:                 38761 [max: 131072]
 combined max dependencies:         2033928
 hardirq-safe locks:                     24
 hardirq-unsafe locks:                  176
 softirq-safe locks:                     53
 softirq-unsafe locks:                  137
 irq-safe locks:                         59
 irq-unsafe locks:                      176

The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.

More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:

   http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:03 -07:00
Ingo Molnar
de30a2b355 [PATCH] lockdep: irqtrace subsystem, core
Accurate hard-IRQ-flags and softirq-flags state tracing.

This allows us to attach extra functionality to IRQ flags on/off
events (such as trace-on/off).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:03 -07:00
Ingo Molnar
8637c09901 [PATCH] lockdep: stacktrace subsystem, core
Framework to generate and save stacktraces quickly, without printing anything
to the console.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Ingo Molnar
e4d9191885 [PATCH] lockdep: locking init debugging improvement
Locking init improvement:

 - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
   to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Ingo Molnar
9cebb55268 [PATCH] lockdep: mutex section binutils workaround
Work around weird section nesting build bug causing smp-alternatives failures
under certain circumstances.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
9a11b49a80 [PATCH] lockdep: better lock debugging
Generic lock debugging:

 - generalized lock debugging framework. For example, a bug in one lock
   subsystem turns off debugging in all lock subsystems.

 - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
   the mutex/rtmutex debugging code: it caused way too much prototype
   hackery, and lockdep will give the same information anyway.

 - ability to do silent tests

 - check lock freeing in vfree too.

 - more finegrained debugging options, to allow distributions to
   turn off more expensive debugging features.

There's no separate 'held mutexes' list anymore - but there's a 'held locks'
stack within lockdep, which unifies deadlock detection across all lock
classes.  (this is independent of the lockdep validation stuff - lockdep first
checks whether we are holding a lock already)

Here are the current debugging options:

CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y

which do:

 config DEBUG_MUTEXES
          bool "Mutex debugging, basic checks"

 config DEBUG_LOCK_ALLOC
         bool "Detect incorrect freeing of live mutexes"

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
fb7e42413a [PATCH] lockdep: remove mutex deadlock checking code
With the lock validator we detect mutex deadlocks (and more), the mutex
deadlock checking code is both redundant and slower.  So remove it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
36596243da [PATCH] lockdep: remove DEBUG_BUG_ON()
cleanup: remove unused DEBUG_BUG_ON() defines.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
9e7f4d451e [PATCH] lockdep: rename DEBUG_WARN_ON()
Rename DEBUG_WARN_ON() to the less generic DEBUG_LOCKS_WARN_ON() name, so that
it's clear that this is a lock-debugging internal mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
c4e05116a2 [PATCH] lockdep: clean up rwsems
Clean up rwsems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
4d435f9d8f [PATCH] lockdep: add is_module_address()
Add is_module_address() method - to be used by lockdep.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:00 -07:00
Christoph Lameter
9614634fe6 [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O
It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
pages.  File I/O will use these pages for the next read/write requests and the
unmapped pages increase.  After the zone has filled up again zone reclaim will
remove it again after only 32 pages.  This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass.  However.  it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:26:59 -07:00
Benjamin Herrenschmidt
5a43a066b1 [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts
Make the fasteoi handler mark disabled interrupts as pending if they
happen anyway. This allow implementation of a delayed disable scheme
with the fasteoi handler.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-07-03 19:54:59 +10:00
Thomas Gleixner
284c66806e [PATCH] genirq:fixup missing SA_PERCPU replacement
The irqflags consolidation converted SA_PERCPU_IRQ to IRQF_PERCPU but
did not define the new constant.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 17:29:22 -07:00
Thomas Gleixner
d061daa0e3 [PATCH] genirq: ARM dyntick cleanup
Linus: "The hacks in kernel/irq/handle.c are really horrid. REALLY
horrid."

They are indeed. Move the dyntick quirks to ARM where they belong.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 17:29:21 -07:00
Linus Torvalds
b4b9034132 Merge branch 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm: (24 commits)
  [ARM] 3683/2:  ARM: Convert at91rm9200 to generic irq handling
  [ARM] 3682/2:  ARM: Convert ixp4xx to generic irq handling
  [ARM] 3702/1: ARM: Convert ixp23xx to generic irq handling
  [ARM] 3701/1: ARM: Convert plat-omap to generic irq handling
  [ARM] 3700/1: ARM: Convert lh7a40x to generic irq handling
  [ARM] 3699/1: ARM: Convert s3c2410 to generic irq handling
  [ARM] 3698/1: ARM: Convert sa1100 to generic irq handling
  [ARM] 3697/1: ARM: Convert shark to generic irq handling
  [ARM] 3696/1: ARM: Convert clps711x to generic irq handling
  [ARM] 3694/1: ARM: Convert ecard driver to generic irq handling
  [ARM] 3693/1: ARM: Convert omap1 to generic irq handling
  [ARM] 3691/1: ARM: Convert imx to generic irq handling
  [ARM] 3688/1: ARM: Convert clps7500 to generic irq handling
  [ARM] 3687/1: ARM: Convert integrator to generic irq handling
  [ARM] 3685/1: ARM: Convert pxa to generic irq handling
  [ARM] 3684/1: ARM: Convert l7200 to generic irq handling
  [ARM] 3681/1: ARM: Convert ixp2000 to generic irq handling
  [ARM] 3680/1: ARM: Convert footbridge to generic irq handling
  [ARM] 3695/1: ARM drivers/pcmcia: Fixup includes
  [ARM] 3689/1: ARM drivers/input/touchscreen: Fixup includes
  ...

Manual conflict resolved in kernel/irq/handle.c (butt-ugly ARM tickless
code).
2006-07-02 15:07:45 -07:00
Thomas Gleixner
3cca53b02a [PATCH] irq-flags: generic irq: Use the new IRQF_ constants
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 13:58:49 -07:00
Thomas Gleixner
f8b5473fcb [ARM] 3690/1: genirq: Introduce and make use of dummy irq chip
Patch from Thomas Gleixner

From: Thomas Gleixner <tglx@linutronix.de>

ARM has a couple of really dumb interrupt controllers.
Implement a generic one and fixup the ARM migration. ARM reused
the no_irq_chip for this purpose, but this does not work out
for platforms which are not converted to the new interrupt
type handling model.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01 22:30:08 +01:00
Thomas Gleixner
a2166abd06 [ARM] 3679/1: ARM: Make ARM dyntick implementation work with genirq
Patch from Thomas Gleixner

From: Thomas Gleixner <tglx@linutronix.de>

Make the ARM dyntick implementation work with the generic
irq code. This hopefully goes away once we consolidated the
dyntick implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01 22:30:07 +01:00
Linus Torvalds
fc25465f09 Merge branch 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] audit syscall classes
  [PATCH] audit: support for object context filters
  [PATCH] audit: rename AUDIT_SE_* constants
  [PATCH] add rule filterkey
2006-07-01 09:59:08 -07:00
Bjorn Helgaas
e8c4b9d003 [PATCH] IRQ: warning message cleanup
Make warnings more consistent.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Bjorn Helgaas
17311c03c3 [PATCH] IRQ: Use SA_PERCPU_IRQ, not IRQ_PER_CPU, for irqaction.flags
IRQ_PER_CPU is a bit in the struct irq_desc "status" field, not in the
struct irqaction "flags", so the previous code checked the wrong bit.

SA_PERCPU_IRQ is only used by drivers/char/mmtimer.c for SGI ia64 boxes.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Ingo Molnar
ed6f7b10e6 [PATCH] pi-futex: futex_wake() lockup fix
Fix futex_wake() exit condition bug when handling the robust-list with PI
futexes on them.

(reported by Ulrich Drepper, debugged by the lock validator.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Vernon Mauery
a99e4e413e [PATCH] pi-futex: fix mm_struct memory leak
lock_queue was getting called essentially twice in a row and was
continually incrementing the mm_count ref count, thus causing a memory
leak.

Dinakar Guniguntala provided a proper fix for the problem that simply grabs
the spinlock for the hash bucket queue rather than calling lock_queue.

The second time we do a queue_lock in futex_lock_pi, we really only need to
take the hash bucket lock.

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Al Viro
b915543b46 [PATCH] audit syscall classes
Allow to tie upper bits of syscall bitmap in audit rules to kernel-defined
sets of syscalls.  Infrastructure, a couple of classes (with 32bit counterparts
for biarch targets) and actual tie-in on i386, amd64 and ia64.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 07:44:10 -04:00
Darrel Goeddel
6e5a2d1d32 [PATCH] audit: support for object context filters
This patch introduces object audit filters based on the elements
of the SELinux context.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>

 kernel/auditfilter.c           |   25 +++++++++++++++++++++++++
 kernel/auditsc.c               |   40 ++++++++++++++++++++++++++++++++++++++++
 security/selinux/ss/services.c |   18 +++++++++++++++++-
 3 files changed, 82 insertions(+), 1 deletion(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 05:44:19 -04:00
Darrel Goeddel
3a6b9f85c6 [PATCH] audit: rename AUDIT_SE_* constants
This patch renames some audit constant definitions and adds
additional definitions used by the following patch.  The renaming
avoids ambiguity with respect to the new definitions.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>

 include/linux/audit.h          |   15 ++++++++----
 kernel/auditfilter.c           |   50 ++++++++++++++++++++---------------------
 kernel/auditsc.c               |   10 ++++----
 security/selinux/ss/services.c |   32 +++++++++++++-------------
 4 files changed, 56 insertions(+), 51 deletions(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 05:44:08 -04:00
Amy Griffis
5adc8a6adc [PATCH] add rule filterkey
Add support for a rule key, which can be used to tie audit records to audit
rules.  This is useful when a watched file is accessed through a link or
symlink, as well as for general audit log analysis.

Because this patch uses a string key instead of an integer key, there is a bit
of extra overhead to do the kstrdup() when a rule fires.  However, we're also
allocating memory for the audit record buffer, so it's probably not that
significant.  I went ahead with a string key because it seems more
user-friendly.

Note that the user must ensure that filterkeys are unique.  The kernel only
checks for duplicate rules.

Signed-off-by: Amy Griffis <amy.griffis@hpd.com>
2006-07-01 05:43:06 -04:00
Linus Torvalds
22a3e233ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  Remove obsolete #include <linux/config.h>
  remove obsolete swsusp_encrypt
  arch/arm26/Kconfig typos
  Documentation/IPMI typos
  Kconfig: Typos in net/sched/Kconfig
  v9fs: do not include linux/version.h
  Documentation/DocBook/mtdnand.tmpl: typo fixes
  typo fixes: specfic -> specific
  typo fixes in Documentation/networking/pktgen.txt
  typo fixes: occuring -> occurring
  typo fixes: infomation -> information
  typo fixes: disadvantadge -> disadvantage
  typo fixes: aquire -> acquire
  typo fixes: mecanism -> mechanism
  typo fixes: bandwith -> bandwidth
  fix a typo in the RTC_CLASS help text
  smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30 15:39:30 -07:00
Andrew Morton
e7b384043e [PATCH] cond_resched() fix
Fix a bug identified by Zou Nan hai <nanhai.zou@intel.com>:

If the system is in state SYSTEM_BOOTING, and need_resched() is true,
cond_resched() returns true even though it didn't reschedule.  Consequently
need_resched() remains true and JBD locks up.

Fix that by teaching cond_resched() to only return true if it really did call
schedule().

cond_resched_lock() and cond_resched_softirq() have a problem too.  If we're
in SYSTEM_BOOTING state and need_resched() is true, these functions will drop
the lock and will then try to call schedule(), but the SYSTEM_BOOTING state
will prevent schedule() from being called.  So on return, need_resched() will
still be true, but cond_resched_lock() has to return 1 to tell the caller that
the lock was dropped.  The caller will probably lock up.

Bottom line: if these functions dropped the lock, they _must_ call schedule()
to clear need_resched().   Make it so.

Also, uninline __cond_resched().  It's largeish, and slowpath.

Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:38 -07:00
David Quigley
8f95dc58d0 [PATCH] SELinux: add security hook call to kill_proc_info_as_uid
This patch adds a call to the extended security_task_kill hook introduced by
the prior patch to the kill_proc_info_as_uid function so that these signals
can be properly mediated by security modules.  It also updates the existing
hook call in check_kill_permission.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:37 -07:00
Christoph Lameter
34aa1330f9 [PATCH] zoned vm counters: zone_reclaim: remove /proc/sys/vm/zone_reclaim_interval
The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone.  Therefore we had to scan in
intervals to figure out if any pages were unmapped.

With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone.  So we can simply skip the
reclaim if there is an insufficient number of unmapped pages.  We use
SWAP_CLUSTER_MAX as the boundary.

Drop all support for /proc/sys/vm/zone_reclaim_interval.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Jörn Engel
6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Pavel Machek
e02169b682 remove obsolete swsusp_encrypt
Remove SWSUSP_ENCRYPT config option; it is no longer implemented.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:59:59 +02:00
Adrian Bunk
80f7228b59 typo fixes: occuring -> occurring
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:27:16 +02:00
Dave Jones
ae90dd5dbe Move workqueue exports to where the functions are defined.
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30 01:40:45 -04:00
Venkatesh Pallipadi
7a6bc1cdd5 [CPUFREQ] Add queue_delayed_work_on() interface for workqueues.
Add queue_delayed_work_on() interface for workqueues.

Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30 01:33:31 -04:00
Darrel Goeddel
c7bdb545d2 [NETLINK]: Encapsulate eff_cap usage within security framework.
This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
the security framework by extending security_netlink_recv to include a required
capability parameter and converting all direct usage of eff_caps outside
of the lsm modules to use the interface.  It also updates the SELinux
implementation of the security_netlink_send and security_netlink_recv
hooks to take advantage of the sid in the netlink_skb_params struct.
This also enables SELinux to perform auditing of netlink capability checks.
Please apply, for 2.6.18 if possible.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by:  James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:55 -07:00
Linus Torvalds
3aa590c6b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits)
  [POWERPC] Use little-endian bit from firmware ibm,pa-features property
  [POWERPC] Make sure smp_processor_id works very early in boot
  [POWERPC] U4 DART improvements
  [POWERPC] todc: add support for Time-Of-Day-Clock
  [POWERPC] Make lparcfg.c work when both iseries and pseries are selected
  [POWERPC] Fix idr locking in init_new_context
  [POWERPC] mpc7448hpc2 (taiga) board config file
  [POWERPC] Add tsi108 pci and platform device data register function
  [POWERPC] Add general support for mpc7448hpc2 (Taiga) platform
  [POWERPC] Correct the MAX_CONTEXT definition
  powerpc: minor cleanups for mpc86xx
  [POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set
  [POWERPC] Simplify the code defining the 64-bit CPU features
  [POWERPC] powerpc: kconfig warning fix
  [POWERPC] Consolidate some of kernel/misc*.S
  [POWERPC] Remove unused function call_with_mmu_off
  [POWERPC] update asm-powerpc/time.h
  [POWERPC] Clean up it_lp_queue.h
  [POWERPC] Skip the "copy down" of the kernel if it is already at zero.
  [POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
  ...
2006-06-29 11:32:34 -07:00
Linus Torvalds
1903ac54f8 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6:
  [PATCH] i386: export memory more than 4G through /proc/iomem
  [PATCH] 64bit Resource: finally enable 64bit resource sizes
  [PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed
  [PATCH] 64bit resource: change pnp core to use resource_size_t
  [PATCH] 64bit resource: change pci core and arch code to use resource_size_t
  [PATCH] 64bit resource: change resource core to use resource_size_t
  [PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource
  [PATCH] 64bit resource: fix up printks for resources in misc drivers
  [PATCH] 64bit resource: fix up printks for resources in arch and core code
  [PATCH] 64bit resource: fix up printks for resources in pcmcia drivers
  [PATCH] 64bit resource: fix up printks for resources in video drivers
  [PATCH] 64bit resource: fix up printks for resources in ide drivers
  [PATCH] 64bit resource: fix up printks for resources in mtd drivers
  [PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers
  [PATCH] 64bit resource: fix up printks for resources in networks drivers
  [PATCH] 64bit resource: fix up printks for resources in sound drivers
  [PATCH] 64bit resource: C99 changes for struct resource declarations

Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that
was changed by the 64-bit resources had been deleted in the meantime ;)
2006-06-29 10:49:17 -07:00
Ingo Molnar
47c2a3aa44 [PATCH] genirq: add chip->eoi(), fastack -> fasteoi
Clean up the fastack concept by turning it into fasteoi and introducing the
->eoi() method for chips.

This also allows the cleanup of an i386 EOI quirk - now the quirk is
cleanly separated from the pure ACK implementation.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:26 -07:00
Benjamin Herrenschmidt
98bb244b68 [PATCH] genirq: fasteoi handler: handle interrupt disabling
Note when a disable interrupt happened with the fasteoi handler as well so
that delayed disable can be implemented with fasteoi-type controllers.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Ingo Molnar
43f7775944 [PATCH] genirq: more verbose debugging on unexpected IRQ vectors
One frequent sign of IRQ handling bugs is the appearance of unexpected
vectors.  Print out all the IRQ state in that case.  We dont want this patch
upstream, but it is useful during initial testing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Ingo Molnar
f1c2662cbc [PATCH] genirq: cleanup: no_irq_type -> no_irq_chip rename
Rename no_irq_type to no_irq_chip.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Thomas Gleixner
e76de9f8eb [PATCH] genirq: add SA_TRIGGER support
Enable drivers to request an IRQ with a given irq-flow (trigger/polarity)
setting.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
ba9a2331ba [PATCH] genirq: add irq-wake (power-management) support
Enable platforms to set the irq-wake (power-management) properties of an IRQ.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Ingo Molnar
7a55713ab4 [PATCH] genirq: add handle_bad_irq()
Handle bad IRQ vectors via the irqchip mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
dd87eb3a24 [PATCH] genirq: add irq-chip support
Enable platforms to use the irq-chip and irq-flow abstractions: allow setting
of the chip, the type and provide highlevel handlers for common irq-flows.

[rostedt@goodmis.org: misroute-irq: Don't call desc->chip->end because of edge interrupts]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
6a6de9ef58 [PATCH] genirq: core
Core genirq support: add the irq-chip and irq-flow abstractions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Ingo Molnar
a34db9b28a [PATCH] genirq: update copyrights
Update/add copyrights in the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
94d39e1f6e [PATCH] genirq: add IRQ_NOAUTOEN support
Enable platforms to disable the automatic enabling of freshly set up irqs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
6550c775cb [PATCH] genirq: add IRQ_NOREQUEST support
Enable platforms to disable request_irq() for certain interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
3418d72404 [PATCH] genirq: add IRQ_NOPROBE support
Introduce IRQ_NOPROBE: enables platforms to control chip-probing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
a4633adcdb [PATCH] genirq: add genirq sw IRQ-retrigger
Enable platforms that do not have a hardware-assisted hardirq-resend mechanism
to resend them via a softirq-driven IRQ emulation mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
77a5afecdb [PATCH] genirq: cleanup: no_irq_type cleanups
Clean up no_irq_type: share the NOP functions where possible, and properly
name the ack_bad() function.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
8d28bc751b [PATCH] genirq: doc: handle_IRQ_event() and __do_IRQ() comments
Document handle_IRQ_event() and __do_IRQ().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
c0ad90a32f [PATCH] genirq: add ->retrigger() irq op to consolidate hw_irq_resend()
Add ->retrigger() irq op to consolidate hw_irq_resend() implementations.
(Most architectures had it defined to NOP anyway.)

NOTE: ia64 needs testing. i386 and x86_64 tested.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Thomas Gleixner
096c8131c5 [PATCH] genirq: debug: better debug printout in enable_irq()
Make enable_irq() debug printouts user-readable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
0d7012a968 [PATCH] genirq: cleanup: turn ARCH_HAS_IRQ_PER_CPU into CONFIG_IRQ_PER_CPU
Cleanup: change ARCH_HAS_IRQ_PER_CPU into a Kconfig method.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
cd916d31cc [PATCH] genirq: cleanup: merge pending_irq_cpumask[] into irq_desc[]
Consolidation: remove the pending_irq_cpumask[NR_IRQS] array and move it into
the irq_desc[NR_IRQS].pending_mask field.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
4a733ee126 [PATCH] genirq: cleanup: merge irq_dir[], smp_affinity_entry[] into irq_desc[]
Consolidation: remove the irq_dir[NR_IRQS] and the smp_affinity_entry[NR_IRQS]
arrays and move them into the irq_desc[] array.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
34ffdb7233 [PATCH] genirq: cleanup: reduce irq_desc_t use, mark it obsolete
Cleanup: remove irq_desc_t use from the generic IRQ code, and mark it
obsolete.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
06fcb0c6fb [PATCH] genirq: cleanup: misc code cleanups
Assorted code cleanups to the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
2e60bbb6d5 [PATCH] genirq: cleanup: remove fastcall
Now that i386 defaults to regparm, explicit uses of fastcall are not needed
anymore.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
a8553acd6c [PATCH] genirq: cleanup: remove irq_descp()
Cleanup: remove irq_descp() - explicit use of irq_desc[] is shorter and more
readable.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
a53da52fd7 [PATCH] genirq: cleanup: merge irq_affinity[] into irq_desc[]
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the
irq_desc[NR_IRQS].affinity field.

[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
74ffd553a3 [PATCH] genirq: sem2mutex probe_sem -> probing_active
Convert the irq auto-probing semaphore to a mutex.  (This allows us to find
probing API usage bugs sooner, via the mutex debugging code.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Ingo Molnar
d1bef4ed5f [PATCH] genirq: rename desc->handler to desc->chip
This patch-queue improves the generic IRQ layer to be truly generic, by adding
various abstractions and features to it, without impacting existing
functionality.

While the queue can be best described as "fix and improve everything in the
generic IRQ layer that we could think of", and thus it consists of many
smaller features and lots of cleanups, the one feature that stands out most is
the new 'irq chip' abstraction.

The irq-chip abstraction is about describing and coding and IRQ controller
driver by mapping its raw hardware capabilities [and quirks, if needed] in a
straightforward way, without having to think about "IRQ flow"
(level/edge/etc.) type of details.

This stands in contrast with the current 'irq-type' model of genirq
architectures, which 'mixes' raw hardware capabilities with 'flow' details.
The patchset supports both types of irq controller designs at once, and
converts i386 and x86_64 to the new irq-chip design.

As a bonus side-effect of the irq-chip approach, chained interrupt controllers
(master/slave PIC constructs, etc.) are now supported by design as well.

The end result of this patchset intends to be simpler architecture-level code
and more consolidation between architectures.

We reused many bits of code and many concepts from Russell King's ARM IRQ
layer, the merging of which was one of the motivations for this patchset.

This patch:

rename desc->handler to desc->chip.

Originally i did not want to do this, because it's a big patch.  But having
both "desc->handler", "desc->handle_irq" and "action->handler" caused a
large degree of confusion and made the code appear alot less clean than it
truly is.

I have also attempted a dual approach as well by introducing a
desc->chip alias - but that just wasnt robust enough and broke
frequently.

So lets get over with this quickly.  The conversion was done automatically
via scripts and converts all the code in the kernel.

This renaming patch is the first one amongst the patches, so that the
remaining patches can stay flexible and can be merged and split up
without having some big monolithic patch act as a merge barrier.

[akpm@osdl.org: build fix]
[akpm@osdl.org: another build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Andrew Morton
84860f9979 [PATCH] load_module() cleanup
Undo bizarre declaration in load_module().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
Arjan van de Ven
f71d20e961 [PATCH] Add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL
Temporarily add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL.  These
will be used as a transition measure for symbols that aren't used in the
kernel and are on the way out.  When a module uses such a symbol, a warning
is printk'd at modprobe time.

The main reason for removing unused exports is size: eacho export takes
roughly between 100 and 150 bytes of kernel space in the binary.  This
patch gives users the option to immediately get this size gain via a config
option.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
David Wilder
c0ce7d0886 [POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
With this patch, kdump uses the firmware soft-reset NMI for two purposes:
1) Initiate the kdump (take a crash dump) by issuing a soft-reset.
2) Break a CPU out of a deadlock condition that is detected during kdump
processing.

When a soft-reset is initiated each CPU will enter
system_reset_exception() and set its corresponding bit in the global
bit-array cpus_in_sr then call die(). When die() finds the CPU's bit set
in cpu_in_sr crash_kexec() is called to initiate a crash dump. The first
CPU to enter crash_kexec() is called the "crashing CPU". All other CPUs
are "secondary CPUs". The secondary CPU's pass through to
crash_kexec_secondary() and sleep. The crashing CPU waits for all CPUs
to enter via soft-reset then boots the kdump kernel (see
crash_soft_reset_check())

When the system crashes due to a panic or exception, crash_kexec() is
called by panic() or die(). The crashing CPU sends an IPI to all other
CPUs to notify them of the pending shutdown. If a CPU is in a deadlock
or hung state with interrupts disabled, the IPI will not be delivered.
The result being, that the kdump kernel is not booted. This problem is
solved with the use of a firmware generated soft-reset. After the
crashing_cpu has issued the IPI, it waits for 10 sec for all CPUs to
enter crash_ipi_callback(). A CPU signifies its entry to
crash_ipi_callback() by setting its corresponding bit in the
cpus_in_crash bit array. After 10 sec, if one or more CPUs have not set
their bit in cpus_in_crash we assume that the CPU(s) is deadlocked. The
operator is then prompted to generate a soft-reset to break the
deadlock. Each CPU enters the soft reset handler as described above.

Two conditions must be handled at this point:
1) The system crashed because the operator generated a soft-reset. See
2) The system had crashed before the soft-reset was generated ( in the
case of a Panic or oops).

The first CPU to enter crash_kexec() uses the state of the kexec_lock to
determine this state. If kexec_lock is already held then condition 2 is
true and crash_kexec_secondary() is called, else; this CPU is flagged as
the crashing CPU, the kexec_lock is acquired and crash_kexec() proceeds
as described above.

Each additional CPUs responding to the soft-reset will pass through
crash_kexec() to kexec_secondary(). All secondary CPUs call
crash_ipi_callback() readying them self's for the shutdown. When ready
they clear their bit in cpus_in_sr. The crashing CPU waits in
kexec_secondary() until all other CPUs have cleared their bits in
cpus_in_sr. The kexec kernel boot is then started.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-28 15:18:52 +10:00
Jesper Juhl
9a66a53f55 [PATCH] Remove redundant NULL checks before [kv]free - in kernel/
Remove redundant kfree NULL checks from kernel/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Sebastien Dugue
59e0e0ace7 [PATCH] futex_requeue() optimization
In futex_requeue(), when the 2 futexes keys hash to the same bucket, there
is no need to move the futex_q to the end of the bucket list.

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner
95e02ca9bb [PATCH] rtmutex: Propagate priority settings into PI lock chains
When the priority of a task, which is blocked on a lock, changes we must
propagate this change into the PI lock chain.  Therefor the chain walk code
is changed to get rid of the references to current to avoid false positives
in the deadlock detector, as setscheduler might be called by a task which
holds the lock on which the task whose priority is changed is blocked.

Also add some comments about the get/put_task_struct usage to avoid
confusion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner
0bafd214e4 [PATCH] rtmutex: Modify rtmutex-tester to test the setscheduler propagation
Make test suite setscheduler calls asynchronously.  Remove the waits in the
test cases and add a new testcase to verify the correctness of the
setscheduler priority propagation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner
e74c69f46d [PATCH] Drop tasklist lock in do_sched_setscheduler
There is no need to hold tasklist_lock across the setscheduler call, when
we pin the task structure with get_task_struct().  Interrupts are disabled
in setscheduler anyway and the permission checks do not need interrupts
disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
c87e2837be [PATCH] pi-futex: futex_lock_pi/futex_unlock_pi support
This adds the actual pi-futex implementation, based on rt-mutexes.

[dino@in.ibm.com: fix an oops-causing race]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
0cdbee9920 [PATCH] pi-futex: rt mutex futex api
Add proxy-locking rt-mutex functionality needed by pi-futexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner
61a8712286 [PATCH] pi-futex: rt mutex tester
RT-mutex tester: scriptable tester for rt mutexes, which allows userspace
scripting of mutex unit-tests (and dynamic tests as well), using the actual
rt-mutex implementation of the kernel.

[akpm@osdl.org: fixlet]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
e7eebaf6a8 [PATCH] pi-futex: rt mutex debug
Runtime debugging functionality for rt-mutexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
23f78d4a03 [PATCH] pi-futex: rt mutex core
Core functions for the rt-mutex subsystem.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
b29739f902 [PATCH] pi-futex: scheduler support for pi
Add framework to boost/unboost the priority of RT tasks.

This consists of:

 - caching the 'normal' priority in ->normal_prio
 - providing a functions to set/get the priority of the task
 - make sched_setscheduler() aware of boosting

The effective_prio() cleanups also fix a priority-calculation bug pointed out
by Andrey Gelman, in set_user_nice().

has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrey Gelman <agelman@012.net.il>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Ingo Molnar
e2970f2fb6 [PATCH] pi-futex: futex code cleanups
We are pleased to announce "lightweight userspace priority inheritance" (PI)
support for futexes.  The following patchset and glibc patch implements it,
ontop of the robust-futexes patchset which is included in 2.6.16-mm1.

We are calling it lightweight for 3 reasons:

 - in the user-space fastpath a PI-enabled futex involves no kernel work
   (or any other PI complexity) at all.  No registration, no extra kernel
   calls - just pure fast atomic ops in userspace.

 - in the slowpath (in the lock-contention case), the system call and
   scheduling pattern is in fact better than that of normal futexes, due to
   the 'integrated' nature of FUTEX_LOCK_PI.  [more about that further down]

 - the in-kernel PI implementation is streamlined around the mutex
   abstraction, with strict rules that keep the implementation relatively
   simple: only a single owner may own a lock (i.e.  no read-write lock
   support), only the owner may unlock a lock, no recursive locking, etc.

  Priority Inheritance - why, oh why???
  -------------------------------------

Many of you heard the horror stories about the evil PI code circling Linux for
years, which makes no real sense at all and is only used by buggy applications
and which has horrible overhead.  Some of you have dreaded this very moment,
when someone actually submits working PI code ;-)

So why would we like to see PI support for futexes?

We'd like to see it done purely for technological reasons.  We dont think it's
a buggy concept, we think it's useful functionality to offer to applications,
which functionality cannot be achieved in other ways.  We also think it's the
right thing to do, and we think we've got the right arguments and the right
numbers to prove that.  We also believe that we can address all the
counter-arguments as well.  For these reasons (and the reasons outlined below)
we are submitting this patch-set for upstream kernel inclusion.

What are the benefits of PI?

  The short reply:
  ----------------

User-space PI helps achieving/improving determinism for user-space
applications.  In the best-case, it can help achieve determinism and
well-bound latencies.  Even in the worst-case, PI will improve the statistical
distribution of locking related application delays.

  The longer reply:
  -----------------

Firstly, sharing locks between multiple tasks is a common programming
technique that often cannot be replaced with lockless algorithms.  As we can
see it in the kernel [which is a quite complex program in itself], lockless
structures are rather the exception than the norm - the current ratio of
lockless vs.  locky code for shared data structures is somewhere between 1:10
and 1:100.  Lockless is hard, and the complexity of lockless algorithms often
endangers to ability to do robust reviews of said code.  I.e.  critical RT
apps often choose lock structures to protect critical data structures, instead
of lockless algorithms.  Furthermore, there are cases (like shared hardware,
or other resource limits) where lockless access is mathematically impossible.

Media players (such as Jack) are an example of reasonable application design
with multiple tasks (with multiple priority levels) sharing short-held locks:
for example, a highprio audio playback thread is combined with medium-prio
construct-audio-data threads and low-prio display-colory-stuff threads.  Add
video and decoding to the mix and we've got even more priority levels.

So once we accept that synchronization objects (locks) are an unavoidable fact
of life, and once we accept that multi-task userspace apps have a very fair
expectation of being able to use locks, we've got to think about how to offer
the option of a deterministic locking implementation to user-space.

Most of the technical counter-arguments against doing priority inheritance
only apply to kernel-space locks.  But user-space locks are different, there
we cannot disable interrupts or make the task non-preemptible in a critical
section, so the 'use spinlocks' argument does not apply (user-space spinlocks
have the same priority inversion problems as other user-space locking
constructs).  Fact is, pretty much the only technique that currently enables
good determinism for userspace locks (such as futex-based pthread mutexes) is
priority inheritance:

Currently (without PI), if a high-prio and a low-prio task shares a lock [this
is a quite common scenario for most non-trivial RT applications], even if all
critical sections are coded carefully to be deterministic (i.e.  all critical
sections are short in duration and only execute a limited number of
instructions), the kernel cannot guarantee any deterministic execution of the
high-prio task: any medium-priority task could preempt the low-prio task while
it holds the shared lock and executes the critical section, and could delay it
indefinitely.

  Implementation:
  ---------------

As mentioned before, the userspace fastpath of PI-enabled pthread mutexes
involves no kernel work at all - they behave quite similarly to normal
futex-based locks: a 0 value means unlocked, and a value==TID means locked.
(This is the same method as used by list-based robust futexes.) Userspace uses
atomic ops to lock/unlock these mutexes without entering the kernel.

To handle the slowpath, we have added two new futex ops:

  FUTEX_LOCK_PI
  FUTEX_UNLOCK_PI

If the lock-acquire fastpath fails, [i.e.  an atomic transition from 0 to TID
fails], then FUTEX_LOCK_PI is called.  The kernel does all the remaining work:
if there is no futex-queue attached to the futex address yet then the code
looks up the task that owns the futex [it has put its own TID into the futex
value], and attaches a 'PI state' structure to the futex-queue.  The pi_state
includes an rt-mutex, which is a PI-aware, kernel-based synchronization
object.  The 'other' task is made the owner of the rt-mutex, and the
FUTEX_WAITERS bit is atomically set in the futex value.  Then this task tries
to lock the rt-mutex, on which it blocks.  Once it returns, it has the mutex
acquired, and it sets the futex value to its own TID and returns.  Userspace
has no other work to perform - it now owns the lock, and futex value contains
FUTEX_WAITERS|TID.

If the unlock side fastpath succeeds, [i.e.  userspace manages to do a TID ->
0 atomic transition of the futex value], then no kernel work is triggered.

If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then
FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of
userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes
up any potential waiters.

Note that under this approach, contrary to other PI-futex approaches, there is
no prior 'registration' of a PI-futex.  [which is not quite possible anyway,
due to existing ABI properties of pthread mutexes.]

Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties
of futexes, and all four combinations are possible: futex, robust-futex,
PI-futex, robust+PI-futex.

  glibc support:
  --------------

Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes
(and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX
mutexes.  (PTHREAD_PRIO_PROTECT support will be added later on too, no
additional kernel changes are needed for that).  [NOTE: The glibc patch is
obviously inofficial and unsupported without matching upstream kernel
functionality.]

the patch-queue and the glibc patch can also be downloaded from:

  http://redhat.com/~mingo/PI-futex-patches/

Many thanks go to the people who helped us create this kernel feature: Steven
Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan
van de Ven, Oleg Nesterov and others.  Credits for related prior projects goes
to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others.

Clean up the futex code, before adding more features to it:

 - use u32 as the futex field type - that's the ABI
 - use __user and pointers to u32 instead of unsigned long
 - code style / comment style cleanups
 - rename hash-bucket name from 'bh' to 'hb'.

I checked the pre and post futex.o object files to make sure this
patch has no code effects.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Steven Rostedt
66e5393a78 [PATCH] BUG() if setscheduler is called from interrupt context
Thomas Gleixner is adding the call to a rtmutex function in setscheduler.
This call grabs a spin_lock that is not always protected by interrupts
disabled.  So this means that setscheduler cant be called from interrupt
context.

To prevent this from happening in the future, this patch adds a
BUG_ON(in_interrupt()) in that function.  (Thanks to akpm <aka.  Andrew
Morton> for this suggestion).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Oleg Nesterov
9fea80e4d9 [PATCH] sched: uninline task_rq_lock()
Saves 543 bytes from sched.o (gcc 3.3.3).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Siddha, Suresh B
5c45bf279d [PATCH] sched: mc/smt power savings sched policy
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
/sys/devices/system/cpu/ control the MC/SMT power savings policy for the
scheduler.

Based on the values (1-enable, 0-disable) for these controls, sched groups
cpu power will be determined for different domains.  When power savings
policy is enabled and under light load conditions, scheduler will minimize
the physical packages/cpu cores carrying the load and thus conserving
power(with a perf impact based on the workload characteristics...  see OLS
2005 CMP kernel scheduler paper for more details..)

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
369381694d [PATCH] sched_domai: Allocate sched_group structures dynamically
As explained here:
	http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2

there is a problem with sharing sched_group structures between two
separate sched_group structures for different sched_domains.

The patch has been tested and found to avoid the kernel lockup problem
described in above URL.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
15f0b676a4 [PATCH] sched_domai: Use kmalloc_node
The sched group structures used to represent various nodes need to be
allocated from respective nodes (as suggested here also:

	http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
d3a5aa9858 [PATCH] sched_domai: Don't use GFP_ATOMIC
Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
allocation.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
51888ca25a [PATCH] sched_domain: handle kmalloc failure
Try to handle mem allocation failures in build_sched_domains by bailing out
and cleaning up thus-far allocated memory.  The patch has a direct consequence
that we disable load balancing completely (even at sibling level) upon *any*
memory allocation failure.

[Lee.Schermerhorn@hp.com: bugfix]
Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Peter Williams
615052dc3b [PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks()
Problem:

To help distribute high priority tasks evenly across the available CPUs
move_tasks() does not, under some circumstances, skip tasks whose load
weight is bigger than the designated amount.  Because the highest priority
task on the busiest queue may be on the expired array it may be moved as a
result of this mechanism.  Apart from not being the most desirable way to
redistribute the high priority tasks (we'd rather move the second highest
priority task), there is a risk that this could set up a loop with this
task bouncing backwards and forwards between the two queues.  (This latter
possibility can be demonstrated by running a nice==-20 CPU bound task on an
otherwise quiet 2 CPU system.)

Solution:

Modify the mechanism so that it does not override skip for the highest
priority task on the CPU.  Of course, if there are more than one tasks at
the highest priority then it will allow the override for one of them as
this is a desirable redistribution of high priority tasks.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams
50ddd96917 [PATCH] sched: modify move_tasks() to improve load balancing outcomes
Problem:

The move_tasks() function is designed to move UP TO the amount of load it
is asked to move and in doing this it skips over tasks looking for ones
whose load weights are less than or equal to the remaining load to be
moved.  This is (in general) a good thing but it has the unfortunate result
of breaking one of the original load balancer's good points: namely, that
(within the limits imposed by the active/expired array model and the fact
the expired is processed first) it moves high priority tasks before low
priority ones and this means there's a good chance (see active/expired
problem for why it's only a chance) that the highest priority task on the
queue but not actually on the CPU will be moved to the other CPU where (as
a high priority task) it may preempt the current task.

Solution:

Modify move_tasks() so that high priority tasks are not skipped when moving
them will make them the highest priority task on their new run queue.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams
2dd73a4f09 [PATCH] sched: implement smpnice
Problem:

The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.

For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running.  The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each.  However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU.  Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met.  The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.

Solution:

The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance.  Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example.  Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2.  If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop.  If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.

One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.

Notes:

struct task_struct:

has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.

struct runqueue:

has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue.  This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens.  This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.

int try_to_wake_up():

in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on.  While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction.  To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).

int move_tasks():

The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists.  This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function.  The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load.  This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.

struct sched_group *find_busiest_group():

Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created.  A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required.  As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.

A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.

void set_user_nice():

has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.

From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>

With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain.  This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4 normal
   priority tasks, with smpnice we would like to see the high priority task
   scheduled on one cpu, two other cpus getting one normal task each and the
   fourth cpu getting the remaining two normal tasks.  but with current
   smpnice extra normal priority task keeps jumping from one cpu to another
   cpu having the normal priority task.  This is because of the
   busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
   cpu with high priority task in max_load calculations but including that in
   total and avg_load calcuations..  leading to max_load < avg_load and load
   balance between cpus running normal priority tasks(2 Vs 1) will always show
   imbalanace as one normal priority and the extra normal priority task will
   keep moving from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
   highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
   priority task each..  P2 and P3 are idle.  With this patch, one of the
   normal priority tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make
   sense to look at the highest weighted runqueue in the busy group..
   Consider a load balance scenario on a DP with HT system, with Package-0
   containing one high priority and one low priority, Package-1 containing one
   low priority(with other thread being idle)..  Package-1 thinks that it need
   to take the low priority thread from Package-0.  And find_busiest_queue()
   returns the cpu thread with highest priority task..  And ultimately(with
   help of active load balance) we move high priority task to Package-1.  And
   same continues with Package-0 now, moving high priority task from package-1
   to package-0..  Even without the presence of active load balance, load
   balance will fail to balance the above scenario..  Fix find_busiest_queue
   to use "imbalance" when it is lightly loaded.

[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Kirill Korotaev
efc30814a8 [PATCH] sched: CPU hotplug race vs. set_cpus_allowed()
There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
__migrate_task() doesn't report any err code, so task can be left on its
runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
possible target.  Also, chaning cpus_allowed mask requires rq->lock being
held.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Acked-By: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt
cc94abfcbc [PATCH] unnecessary long index i in sched
Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
as a long long here.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Con Kolivas
72d2854d4e [PATCH] sched: fix interactive ceiling code
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough.  The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what
exactly it does, just the reason it does it.  Fix it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass.  Fix it.

Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

Opportunity to micro-optimise.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt
d444886e14 [PATCH] sched: simplify bitmap definition
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chen, Kenneth W
c96d145e71 [PATCH] sched: fix smt nice lock contention and optimization
Initial report and lock contention fix from Chris Mason:

Recent benchmarks showed some performance regressions between 2.6.16 and
2.6.5.  We tracked down one of the regressions to lock contention in
schedule heavy workloads (~70,000 context switches per second)

kernel/sched.c:dependent_sleeper() was responsible for most of the lock
contention, hammering on the run queue locks.  The patch below is more of a
discussion point than a suggested fix (although it does reduce lock
contention significantly).  The dependent_sleeper code looks very expensive
to me, especially for using a spinlock to bounce control between two
different siblings in the same cpu.

It is further optimized:

* perform dependent_sleeper check after next task is determined
* convert wake_sleeping_dependent to use trylock
* skip smt runqueue check if trylock fails
* optimize double_rq_lock now that smt nice is converted to trylock
* early exit in searching first SD_SHARE_CPUPOWER domain
* speedup fast path of dependent_sleeper

[akpm@osdl.org: cleanup]
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Chris Mason <mason@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chandra Seetharaman
26c2143b63 [PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit only
Make notifier_calls associated with cpu_notifier as __cpuinit.

__cpuinit makes sure that the function is init time only unless
CONFIG_HOTPLUG_CPU is defined.

[akpm@osdl.org: section fix]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
65edc68c34 [PATCH] cpu hotplug: make [un]register_cpu_notifier init time only
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined).
So, cpu_notifier functionality need to be available only at init time.

This patch makes register_cpu_notifier() available only at init time, unless
CONFIG_HOTPLUG_CPU is defined.

This patch exports register_cpu_notifier() and unregister_cpu_notifier() only
if CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
054cc8a2d8 [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17
This patch reverts notifier_block changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
9c7b216d23 [PATCH] cpu hotplug: revert init patch submitted for 2.6.17
In 2.6.17, there was a problem with cpu_notifiers and XFS.  I provided a
band-aid solution to solve that problem.  In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).

We deferred the real fix to 2.6.18.  Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).

If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.

This patch reverts the notifier_call changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney
c32e066057 [PATCH] rcutorture: add call_rcu_bh() operations
Add operations for the call_rcu_bh() variant of RCU.  Also add an
rcu_batches_completed_bh() function, which is needed by rcutorture.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney
72e9bb5492 [PATCH] rcutorture: add ops vector and Classic RCU ops
Add an ops vector to rcutorture, and add the ops for Classic RCU.  Update
the rcutorture documentation to reflect slight change to the dmesg formats.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney
29766f1eb3 [PATCH] rcutorture: catchup doc fixes for idle-hz tests
This just catches the RCU torture documentation up with the recent fixes
that test RCU for architectures that turn of the scheduling-clock interrupt
for idle CPUs and the addition of a SUCCESS/FAILURE indication, fixing up
an obsolete comment as well.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Randy Dunlap
1dbe83c344 [PATCH] fix kernel-doc in kernel/ dir
Fix kernel-doc parameters in kernel/

Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1376): No description found for parameter 'u_abs_timeout'
Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_msg_prio'
Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_abs_timeout'
Warning(/var/linsrc/linux-2617-g9//kernel/acct.c:526): No description found for parameter 'pacct'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:39 -07:00
Ingo Molnar
34af946a22 [PATCH] spin/rwlock init cleanups
locking init cleanups:

 - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
 - convert rwlocks in a similar manner

this patch was generated automatically.

Motivation:

 - cleanliness
 - lockdep needs control of lock initialization, which the open-coded
   variants do not give
 - it's also useful for -rt and for lock debugging in general

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:39 -07:00
Randy Dunlap
a7807a32bb [PATCH] poison: add & use more constants
Add more poison values to include/linux/poison.h.  It's not clear to me
whether some others should be added or not, so I haven't added any of
these:

./include/linux/libata.h:#define ATA_TAG_POISON		0xfafbfcfdU
./arch/ppc/8260_io/fcc_enet.c:1918:	memset((char *)(&(immap->im_dprambase[(mem_addr+64)])), 0x88, 32);
./drivers/usb/mon/mon_text.c:429:	memset(mem, 0xe5, sizeof(struct mon_event_text));
./drivers/char/ftape/lowlevel/ftape-ctl.c:738:		memset(ft_buffer[i]->address, 0xAA, FT_BUFF_SIZE);
./drivers/block/sx8.c:/* 0xf is just arbitrary, non-zero noise; this is sorta like poisoning */

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00
Ingo Molnar
e6e5494cb2 [PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.

Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.

It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).

There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO.  Newer
distributions (using glibc 2.3.3 or later) can turn this option off.  Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.

There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.

(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)

This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00
KAMEZAWA Hiroyuki
2842f11419 [PATCH] catch valid mem range at onlining memory
This patch allows hot-add memory which is not aligned to section.

Now, hot-added memory has to be aligned to section size.  Considering big
section sized archs, this is not useful.

When hot-added memory is registerd as iomem resoruce by iomem resource
patch, we can make use of that information to detect valid memory range.

Note: With this, not-aligned memory can be registerd. To allow hot-add
      memory with holes, we have to do more work around add_memory().
      (It doesn't allows add memory to already existing mem section.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:36 -07:00
Andrew Morton
5c31f2738a [PATCH] pm_trace is dangerous
CONFIG_PM_TRACES scrogs your RTC.  Mark it as experimental, and defaulting to
`off'.

Also beef up the help message a bit.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:35 -07:00
Randy Dunlap
7f32a25f63 [PATCH] kernel/acct: fix function definition
kernel/acct.c:579:19: warning: non-ANSI function declaration of function 'acct_process'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:35 -07:00
Greg Kroah-Hartman
6550e07f41 [PATCH] 64bit Resource: finally enable 64bit resource sizes
Introduce the Kconfig entry and actually switch to a 64bit value, if
wanted, for resource_size_t.

Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com>

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:24:00 -07:00
Greg Kroah-Hartman
d75fc8bbcc [PATCH] 64bit resource: change resource core to use resource_size_t
Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com>

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:23:59 -07:00
Greg Kroah-Hartman
685143ac1f [PATCH] 64bit resource: fix up printks for resources in arch and core code
This is needed if we wish to change the size of the resource structures.

Based on an original patch from Vivek Goyal <vgoyal@in.ibm.com> and
Andrew Morton.

(tweaked by Andy Isaacson <adi@hexapodia.org>)

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andy Isaacson <adi@hexapodia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:23:59 -07:00
Linus Torvalds
2a2ed2db35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits)
  kbuild: trivial fixes in Makefile
  kbuild: adding symbols in Kconfig and defconfig to TAGS
  kbuild: replace abort() with exit(1)
  kbuild: support for %.symtypes files
  kbuild: fix silentoldconfig recursion
  kbuild: add option for stripping modules while installing them
  kbuild: kill some false positives from modpost
  kbuild: export-symbol usage report generator
  kbuild: fix make -rR breakage
  kbuild: append -dirty for updated but uncommited changes
  kbuild: append git revision for all untagged commits
  kbuild: fix module.symvers parsing in modpost
  kbuild: ignore make's built-in rules & variables
  kbuild: bugfix with initramfs
  kbuild: modpost build fix
  kbuild: check license compatibility when building modules
  kbuild: export-type enhancement to modpost.c
  kbuild: add dependency on kernel.release to the package targets
  kbuild: `make kernelrelease' speedup
  kconfig: KCONFIG_OVERWRITECONFIG
  ...
2006-06-26 11:05:15 -07:00
Linus Torvalds
81a07d7588 Merge branch 'x86-64'
* x86-64: (83 commits)
  [PATCH] x86_64: x86_64 stack usage debugging
  [PATCH] x86_64: (resend) x86_64 stack overflow debugging
  [PATCH] x86_64: msi_apic.c build fix
  [PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs
  [PATCH] x86_64: Avoid broadcasting NMI IPIs
  [PATCH] x86_64: fix apic error on bootup
  [PATCH] x86_64: enlarge window for stack growth
  [PATCH] x86_64: Minor string functions optimizations
  [PATCH] x86_64: Move export symbols to their C functions
  [PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR
  [PATCH] x86_64: Fix modular pc speaker
  [PATCH] x86_64: remove sys32_ni_syscall()
  [PATCH] x86_64: Do not use -ffunction-sections for modules
  [PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle
  [PATCH] x86_64: adjust kstack_depth_to_print default
  [PATCH] i386/x86-64: adjust /proc/interrupts column headings
  [PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels
  [PATCH] x86_64: Fix fast check in safe_smp_processor_id
  [PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information
  [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
  ...

Manual resolve of trivial conflict in arch/i386/kernel/Makefile
2006-06-26 10:51:09 -07:00
Andi Kleen
495ab9c045 [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.

Converted i386/x86-64/ia64 for now because that was the easiest
way to fix ACPI which also manipulates these flags in its idle
function.

Cc: Nick Piggin <npiggin@novell.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:21 -07:00
Jan Beulich
83f4fcce7f [PATCH] x86_64: allow unwinder to build without module support
Add proper conditionals to be able to build with CONFIG_MODULES=n.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:18 -07:00
Jan Beulich
c33bd9aac0 [PATCH] i386/x86-64: fall back to old-style call trace if no unwinding
If no unwinding is possible at all for a certain exception instance,
fall back to the old style call trace instead of not showing any trace
at all.

Also, allow setting the stack trace mode at the command line.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:18 -07:00
Jan Beulich
4552d5dc08 [PATCH] x86_64: reliable stack trace support
These are the generic bits needed to enable reliable stack traces based
on Dwarf2-like (.eh_frame) unwind information. Subsequent patches will
enable x86-64 and i386 to make use of this.

Thanks to Andi Kleen and Ingo Molnar, who pointed out several possibilities
for improvement.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:17 -07:00
Andi Kleen
bebfa1013e [PATCH] x86_64: Add compat_printk and sysctl to turn off compat layer warnings
Sometimes e.g. with crashme the compat layer warnings can be noisy.
Add a way to turn them off by gating all output through compat_printk
that checks a global sysctl. The default is not changed.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:16 -07:00
Peter Williams
b78709cfd4 [PATCH] sched: fix SCHED_FIFO bug in sys_sched_rr_get_interval()
The introduction of SCHED_BATCH scheduling class with a value of 3 means
that the expression (p->policy & SCHED_FIFO) will return true if policy
is SCHED_BATCH or SCHED_FIFO.

Unfortunately, this expression is used in sys_sched_rr_get_interval()
and in the absence of a comment to say that this is intentional I
presume that it is unintentional and erroneous.

The fix is to change the expression to (p->policy == SCHED_FIFO).

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:02:41 -07:00
Oleg Nesterov
cf2dfbfbf4 [PATCH] coredump: copy_process: don't check SIGNAL_GROUP_EXIT
After the previous patch SIGNAL_GROUP_EXIT implies a pending SIGKILL, we
can remove this check from copy_process() because we already checked
!signal_pending().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:27 -07:00
Oleg Nesterov
d5f70c00ad [PATCH] coredump: kill ptrace related stuff
With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to
the thread group.  This means that a TASK_TRACED task

	1. Will be awakened by signal_wake_up(1)

	2. Can't sleep again via ptrace_notify()

	3. Can't go to do_signal_stop() after return
	   from ptrace_stop() in get_signal_to_deliver()

So we can remove all ptrace related stuff from coredump path.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:27 -07:00
Eric W. Biederman
df26c40e56 [PATCH] proc: Cleanup proc_fd_access_allowed
In process of getting proc_fd_access_allowed to work it has developed a few
warts.  In particular the special case that always allows introspection and
the special case to allow inspection of kernel threads.

The special case for introspection is needed for /proc/self/mem.

The special case for kernel threads really should be overridable
by security modules.

So consolidate these checks into ptrace.c:may_attach().

The check to always allow introspection is trivial.

The check to allow access to kernel threads, and zombies is a little
trickier.  mem_read and mem_write already verify an mm exists so it isn't
needed twice.  proc_fd_access_allowed only doesn't want a check to verify
task->mm exits, s it prevents all access to kernel threads.  So just move
the task->mm check into ptrace_attach where it is needed for practical
reasons.

I did a quick audit and none of the security modules in the kernel seem to
care if they are passed a task without an mm into security_ptrace.  So the
above move should be safe and it allows security modules to come up with
more restrictive policy.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman
13b41b0949 [PATCH] proc: Use struct pid not struct task_ref
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
that they work with struct pid instead of struct task_ref.

Mostly this is a straight 1-1 substitution.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman
99f8955183 [PATCH] proc: don't lock task_structs indefinitely
Every inode in /proc holds a reference to a struct task_struct.  If a
directory or file is opened and remains open after the the task exits this
pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
file descriptor is about 10K.

Normally I would figure a reasonable per user process limit is about 100
processes.  With 80 processes, with a 1000 file descriptors each I can trigger
the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
data.

This patch replaces the struct task_struct pointer with a pointer to a struct
task_ref which has a struct task_struct pointer.  The so the pinning of dead
tasks does not happen.

The code now has to contend with the fact that the task may now exit at any
time.  Which is a little but not muh more complicated.

With this change it takes about 1000 processes each opening up 1000 file
descriptors before I can trigger the OOM killer.  Much better.

[mlp@google.com: task_mmu small fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Paul Jackson <pj@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Albert Cahalan <acahalan@gmail.com>
Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman
48e6484d49 [PATCH] proc: Rewrite the proc dentry flush on exit optimization
To keep the dcache from filling up with dead /proc entries we flush them on
process exit.  However over the years that code has gotten hairy with a
dentry_pointer and a lock in task_struct and misdocumented as a correctness
feature.

I have rewritten this code to look and see if we have a corresponding entry in
the dcache and if so flush it on process exit.  This removes the extra fields
in the task_struct and allows me to trivially handle the case of a
/proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:24 -07:00
Anil S Keshavamurthy
e6f47f978b [PATCH] Notify page fault call chain
With this patch Kprobes now registers for page fault notifications only when
their is an active probe registered.  Once all the active probes are
unregistered their is no need to be notified of page faults and kprobes
unregisters itself from the page fault notifications.  Hence we will have ZERO
side effects when no probes are active.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
Anil S Keshavamurthy
3d5631e063 [PATCH] Kprobes registers for notify page fault
Kprobes now registers for page fault notifications.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavmurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
mao, bibo
3672165677 [PATCH] Kprobe: multi kprobe posthandler for booster
If there are multi kprobes on the same probepoint, there will be one extra
aggr_kprobe on the head of kprobe list.  The aggr_kprobe has
aggr_post_handler/aggr_break_handler whether the other kprobe
post_hander/break_handler is NULL or not.  This patch modifies this, only
when there is one or more kprobe in the list whose post_handler is not
NULL, post_handler of aggr_kprobe will be set as aggr_post_handler.

[soshima@redhat.com: !CONFIG_PREEMPT fix]
Signed-off-by: bibo, mao <bibo.mao@intel.com>
Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Yumiko Sugita <sugita@sdl.hitachi.co.jp>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Satoshi Oshima <soshima@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
Roman Zippel
19923c190e [PATCH] fix and optimize clock source update
This fixes the clock source updates in update_wall_time() to correctly
track the time coming in via current_tick_length().  Optimize the fast
paths to be as short as possible to keep the overhead low.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz
a275254975 [PATCH] time: rename clocksource functions
As suggested by Roman Zippel, change clocksource functions to use
clocksource_xyz rather then xyz_clocksource to avoid polluting the
namespace.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz
5d0cf410e9 [PATCH] Time: i386 Clocksource Drivers
Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc).
With this patch, the conversion of the i386 arch to the generic timekeeping
code should be complete.

The patch should be fairly straight forward, only adding the new clocksources.

[hirofumi@mail.parknet.co.jp: acpi_pm cleanup]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz
cf3c769b4b [PATCH] Time: Introduce arch generic time accessors
Introduces clocksource switching code and the arch generic time accessor
functions that use the clocksource infrastructure.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
5eb6d20533 [PATCH] Time: Use clocksource abstraction for NTP adjustments
Instead of incrementing xtime by tick_nsec + ntp adjustments, use the
clocksource abstraction to increment and scale time.  Using the clocksource
abstraction allows other clocksources to be used consistently in the face of
late or lost ticks, while preserving the existing behavior via the jiffies
clocksource.

This removes the need to keep time_phase adjustments as we just use the
current_tick_length() function as the NTP interface and accumulate time using
shifted nanoseconds.

The basics of this design was by Roman Zippel, however it is my own
interpretation and implementation, so the credit should go to him and the
blame to me.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
260a42309b [PATCH] Time: Let user request precision from current_tick_length()
Change the current_tick_length() function so it takes an argument which
specifies how much precision to return in shifted nanoseconds.  This provides
a simple way to convert between NTPs internal nanoseconds shifted by
(SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the
clocksource abstraction.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
ad596171ed [PATCH] Time: Use clocksource infrastructure for update_wall_time
Modify the update_wall_time function so it increments time using the
clocksource abstraction instead of jiffies.  Since the only clocksource driver
currently provided is the jiffies clocksource, this should result in no
functional change.  Additionally, a timekeeping_init and timekeeping_resume
function has been added to initialize and maintain some of the new timekeping
state.

[hirofumi@mail.parknet.co.jp: fixlet]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
734efb467b [PATCH] Time: Clocksource Infrastructure
This introduces the clocksource management infrastructure.  A clocksource is a
driver-like architecture generic abstraction of a free-running counter.  This
code defines the clocksource structure, and provides management code for
registering, selecting, accessing and scaling clocksources.

Additionally, this includes the trivial jiffies clocksource, a lowest common
denominator clocksource, provided mainly for use as an example.

[hirofumi@mail.parknet.co.jp: Don't enable IRQ too early]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
Ingo Molnar
81615b624a [PATCH] Convert kernel/cpu.c to mutexes
Convert kernel/cpu.c from semaphore to mutex.

I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to
fit mutex semantics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Ingo Molnar
1fb00c6cbd [PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs
It seems ppc64 wants to lock mutexes in early bootup code, with interrupts
disabled, and they expect interrupts to stay disabled, else they crash.

Work around this bug by making mutex debugging variants save/restore irq
flags.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Linus Torvalds
3448097fcc Revert "swsusp special saveable pages support" commits
This reverts commits

  3e3318dee0 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages
  b6370d96e0 [PATCH] swsusp: i386 mark special saveable/unsaveable pages
  ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support

because not only do they apparently cause page faults on x86, the
infrastructure doesn't compile on powerpc.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 18:41:00 -07:00
Linus Torvalds
72cf2709bf Fix PM_TRACE dependency: works only on 32-bit x86 for now
Not that x86-64 and other architecture support should be difficult to
add (trivial fixups to the data format and add the proper linker script
entry).

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:04:15 -07:00
KaiGai Kohei
77787bfb44 [PATCH] pacct: none-delayed process accounting accumulation
In current 2.6.17 implementation, signal_struct refered from task_struct is
used for per-process data structure.  The pacct facility also uses it as a
per-process data structure to store stime, utime, minflt, majflt.  But those
members are saved in __exit_signal().  It's too late.

For example, if some threads exits at same time, pacct facility has a
possibility to drop accountings for a part of those threads.  (see, the
following 'The results of original 2.6.17 kernel') I think accounting
information should be completely collected into the per-process data structure
before writing out an accounting record.

This patch fixes this matter.  Accumulation of stime, utime, minflt and majflt
are done before generating accounting record.

[mingo@elte.hu: fix acct_collect() siglock bug found by lockdep]
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:25 -07:00
KaiGai Kohei
f6ec29a42d [PATCH] pacct: avoidance to refer the last thread as a representation of the process
When pacct facility generate an 'ac_flag' field in accounting record, it
refers a task_struct of the thread which died last in the process.  But any
other task_structs are ignored.

Therefore, pacct facility drops ASU flag even if root-privilege operations are
used by any other threads except the last one.  In addition, AFORK flag is
always set when the thread of group-leader didn't die last, although this
process has called execve() after fork().

We have a same matter in ac_exitcode.  The recorded ac_exitcode is an exit
code of the last thread in the process.  There is a possibility this exitcode
is not the group leader's one.
2006-06-25 10:01:25 -07:00
KaiGai Kohei
0e4648141a [PATCH] pacct: add pacct_struct to fix some pacct bugs.
The pacct facility need an i/o operation when an accounting record is
generated.  There is a possibility to wake OOM killer up.  If OOM killer is
activated, it kills some processes to make them release process memory
regions.

But acct_process() is called in the killed processes context before calling
exit_mm(), so those processes cannot release own memory.  In the results, any
processes stop in this point and it finally cause a system stall.
2006-06-25 10:01:25 -07:00
Randy Dunlap
9e37bd301e [PATCH] kthread: move kernel-doc and put it into DocBook
Move kthread API kernel-doc from kthread.h to kthread.c & fix it.
Add kthread API to kernel-api DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:24 -07:00
Randy Dunlap
fa9799e33d [PATCH] ktime/hrtimer: fix kernel-doc comments
Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:23 -07:00
Heiko Carstens
fc75cdfa5b [PATCH] cpu hotplug: fix CPU_UP_CANCEL handling
If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be
called with CPU_UP_CANCELED.  A few of these callbacks assume that on
CPU_UP_PREPARE a pointer to task has been stored in a percpu array.  This
assumption is not true if CPU_UP_PREPARE fails and the following calls to
kthread_bind() in CPU_UP_CANCELED will cause an addressing exception
because of passing a NULL pointer.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Serge E. Hallyn
8bdd1d1250 [PATCH] kthread: convert stop_machine into a kthread
- Update stop_machine.c to spawn stop_machine as kthreads rather than the
  deprecated kernel_threads.

- Update stop_machine to use the more efficient kthread_bind() before
  running task in place of set_cpus_allowed() after.

[akpm@osdl.org: remove now-wrong set_cpus_allowed()]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Anton Blanchard
2aa92581fb [PATCH] Link error when futexes are disabled on 64bit architectures
If futexes are disabled we fail to link on ppc64.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:17 -07:00
akpm@osdl.org
838cd153a5 [PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__
I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs.

Among the NPTL test failures seen are some arising from sigsuspend problems
for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked
despite glibc's carefully excluding it from sets of signals to block.
Specifically, testing suggests it blocks signal N^32 instead of signal N,
so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead.

glibc's sigset_t uses an array of unsigned long, as does the kernel.
In both cases, signal N+1 is represented as
(1UL << (N % (8 * sizeof (unsigned long)))) in word number
(N / (8 * sizeof (unsigned long))).

Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an
array of 64-bit words.  For little-endian, the layout is the same, with
signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for
big-endian, userspace has that layout while in the kernel each 8 bytes have
the two halves swapped from the userspace layout.

The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace
sigset to kernel format.  If __COMPAT_ENDIAN_SWAP__ is *not* set, this uses
logic of the form

  set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 )

to convert the userspace sigset to a kernel one.  This looks correct to me
for both big and little endian, given that in userspace compat->sig[1] will
represent signals 33-64, and so will the high 32 bits of set->sig[0] in the
kernel.  If however __COMPAT_ENDIAN_SWAP__ *is* set, as it is for
__MIPSEL__, it uses

  set->sig[0] = compat->sig[1] | (((long)compat->sig[0]) << 32 );

which seems incorrect for both big and little endian, and would
explain the observed symptoms.

This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect
then that macro serves no purpose, in which case something like the
following patch would seem appropriate to remove it.

Signed-off-by: Joseph Myers <joseph@codesourcery.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Stephen Hemminger
eab03ac7bd [PATCH] Get rid of /proc/sys/proc
The table is empty, why does it still exist?

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Jan Engelhardt
3b9c04106b [PATCH] printk time parameter
Currently, enabling/disabling printk timestamps is only possible through
reboot (bootparam) or recompile.  I normally do not run with timestamps
(since syslog handles that in a good manner), but for measuring small
kernel delays (e.g.  irq probing - see parport thread) I needed subsecond
precision, but then again, just for some minutes rather than all kernel
messages to come.  The following patch adds a module_param() with which the
timestamps can be en-/disabled in a live system through
/sys/modules/printk/parameters/printk_time.

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:13 -07:00
Matt Helsley
11e64757f9 [PATCH] Remove unecessary NULL check in kernel/acct.c
copy_process() appears to be the only caller of acct_clear_integrals() and
does not pass in NULL task pointers.  Remove the unecessary check.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:09 -07:00
Andreas Mohr
3b364b8d58 [PATCH] constify parts of kernel/power/
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:08 -07:00
Andrew Morton
b61367732f [PATCH] schedule_on_each_cpu(): reduce kmalloc() size
schedule_on_each_cpu() presently does a large kmalloc - 96 kbytes on 1024 CPU
64-bit.

Rework it so that we do one 8192-byte allocation and then a pile of tiny ones,
via alloc_percpu().  This has a much higher chance of success (100% in the
current VM).

This also has the effect of reducing the memory requirements from NR_CPUS*n to
num_possible_cpus()*n.

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:07 -07:00
Adrian Bunk
83cc5ed3c4 [PATCH] kernel/sys.c: cleanups
- proper prototypes for the following functions:
  - ctrl_alt_del()  (in include/linux/reboot.h)
  - getrusage()     (in include/linux/resource.h)
- make the following needlessly global functions static:
  - kernel_restart_prepare()
  - kernel_kexec()

[akpm@osdl.org: compile fix]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:06 -07:00
Michael Ellerman
76a8ad2939 [PATCH] Make printk work for really early debugging
Currently printk is no use for early debugging because it refuses to
actually print anything to the console unless
cpu_online(smp_processor_id()) is true.

The stated explanation is that console drivers may require per-cpu
resources, or otherwise barf, because the system is not yet setup
correctly.  Fair enough.

However some console drivers might be quite happy running early during
boot, in fact we have one, and so it'd be nice if printk understood that.

So I added a flag (which I would have called CON_BOOT, but that's taken)
called CON_ANYTIME, which indicates that a console is happy to be called
anytime, even if the cpu is not yet online.

Tested on a Power 5 machine, with both a CON_ANYTIME driver and a bogus
console driver that BUG()s if called while offline.  No problems AFAICT.
Built for i386 UP & SMP.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:05 -07:00
Alan Stern
bbb1747d4e [PATCH] Allow raw_notifier callouts to unregister themselves
Since raw_notifier chains don't benefit from any centralized locking
protections, they shouldn't suffer from the associated limitations.  Under
some circumstances it might make sense for a raw_notifier callout routine
to unregister itself from the notifier chain.  This patch (as678) changes
the notifier core to allow for such things.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Paul Mackerras
bfe5d83419 [PATCH] Define __raw_get_cpu_var and use it
There are several instances of per_cpu(foo, raw_smp_processor_id()), which
is semantically equivalent to __get_cpu_var(foo) but without the warning
that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled.  For
those architectures with optimized per-cpu implementations, namely ia64,
powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
on those platforms.

This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
raw_smp_processor_id()) on architectures that use the generic per-cpu
implementation, and turns into __get_cpu_var(x) on the architectures that
have an optimized per-cpu implementation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Jesper Juhl
f867d2a2e5 [PATCH] ensure NULL deref can't possibly happen in is_exported()
If CONFIG_KALLSYMS is defined and if it should happen that is_exported() is
given a NULL 'mod' and lookup_symbol(name, __start___ksymtab,
__stop___ksymtab) returns 0, then we'll end up dereferencing a NULL
pointer.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:00:59 -07:00
Linus Torvalds
eb71c87a49 Add some basic resume trace facilities
Considering that there isn't a lot of hw we can depend on during resume,
this is about as good as it gets.

This is x86-only for now, although the basic concept (and most of the
code) will certainly work on almost any platform.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-24 14:44:01 -07:00
Eric Sesterhenn
125e18745f [PATCH] More BUG_ON conversion
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: "Salyzyn, Mark" <mark_salyzyn@adaptec.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Jan Beulich
908dcecda1 [PATCH] adjust handle_IRR_event() return type
Correct the return type of handle_IRQ_event() (inconsistency noticed during
Xen development), and remove redundant declarations.  The return type
adjustment required breaking out the definition of irqreturn_t into a
separate header, in order to satisfy current include order dependencies.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Porpoise
3439dd86e3 [PATCH] When CONFIG_BASE_SMALL=1, cascade() may enter an infinite loop
When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop.
Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list
base->tv5 may cascade into base->tv5.  So, the kernel enters the infinite
loop in the function cascade().

I created a test module to verify this bug, and a patch to fix it.

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/timer.h>
#if 0
#include <linux/kdb.h>
#else
#define kdb_printf printk
#endif

#define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
#define TVN_SIZE (1 << TVN_BITS)
#define TVR_SIZE (1 << TVR_BITS)
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)

#define TV_SIZE(N)  (N*TVN_BITS  + TVR_BITS)

struct timer_list timer0;
struct timer_list dummy_timer1;
struct timer_list dummy_timer2;

void dummy_timer_fun(unsigned long data) {
}
unsigned long j=0;
void check_timer_base(unsigned long data)
{
        kdb_printf("check_timer_base %08x\n",jiffies);
        mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF);
}

int init_module(void)
{
        init_timer(&timer0);
        timer0.data = (unsigned long)0;
        timer0.function = check_timer_base;
        mod_timer(&timer0,jiffies+1);

        init_timer(&dummy_timer1);
        dummy_timer1.data = (unsigned long)0;
        dummy_timer1.function = dummy_timer_fun;

        init_timer(&dummy_timer2);
        dummy_timer2.data = (unsigned long)0;
        dummy_timer2.function = dummy_timer_fun;

        j=jiffies;
        j&=(~((1<<TV_SIZE(3))-1));
        j+=(1<<TV_SIZE(3));
        j+=(1<<TV_SIZE(4));

        kdb_printf("mod_timer %08x\n",j);

        mod_timer(&dummy_timer1, j );
        mod_timer(&dummy_timer2, j );

        return 0;
}

void cleanup_module()
{
        del_timer_sync(&timer0);
        del_timer_sync(&dummy_timer1);
        del_timer_sync(&dummy_timer2);
}

(Cleanups from Oleg)

[oleg@tv-sign.ru: use list_replace_init()]
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Oleg Nesterov
626ab0e69d [PATCH] list: use list_replace_init() instead of list_splice_init()
list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1.  We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Randy Dunlap
862f5f0133 [PATCH] Doc: add audit & acct to DocBook
Fix one audit kernel-doc description (one parameter was missing).
Add audit*.c interfaces to DocBook.
Add BSD accounting interfaces to DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Paul E. McKenney
d83015b8f6 [PATCH] Make RCU API inaccessible to non-GPL Linux kernel modules
Remove synchronize_kernel() (deprecated 2-APR-2005 in
http://lkml.org/lkml/2005/4/3/11) and makes the RCU API inaccessible to
non-GPL Linux kernel modules (as was announced more than one year ago in
http://lkml.org/lkml/2005/4/3/8).  Tested on x86 and ppc64.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Jes Sorensen
55f4e8d156 [PATCH] kernel/sys.c doesn't need init.h
kernel/sys.c doesn't have anything in it relying on linux/init.h -
remove the include.

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Andrew Morton
57ae250861 [PATCH] CONFIG_NET=n build fix
Cc: Greg KH <greg@kroah.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:06 -07:00
Andreas Mohr
83d4e6e7fb [PATCH] make noirqdebug/irqfixup __read_mostly, add (un)likely()
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:05 -07:00
Daniel Walker
89d0cf01c0 [PATCH] invert irq/migration.c brach prediction
If you get to that point in the code it means that desc->move_irq is set,
pending_irq_cpumask[irq] and cpu_online_map should have a value.  Still
pretty good chance anding those two you'll still have a value.  So these
two branch predictors should be inverted.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Ingo Molnar
8e0a43d8fa [PATCH] cond_resched() might_sleep() fix
add the __might_sleep() check back to cond_resched().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Prasanna Meda
6e66726047 [PATCH] dup fd error fix
Set errorp in dup_fd, it will be used in sys_unshare also.

Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Andrew Morton
0ae26f1b31 [PATCH] mmput() might sleep
exit_aio() and exit_mmap() can sleep.  But it's easy to accidentally call
mmput() from inside locks.

Cc: Dave Peterson <dsp@llnl.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:03 -07:00
Jeff Moyer
c330dda908 [PATCH] Add a sysfs file to determine if a kexec kernel is loaded
Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded.  Each
file contains a simple boolean value indicating whether the relevant kernel
has been loaded into memory.  The motivation for this is geared around
support.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:02 -07:00
Rafael J. Wysocki
968808b895 [PATCH] swsusp: use less memory during resume
Make swsusp allocate only as much memory as needed to store the image data
and metadata during resume.

Without this patch swsusp additionally allocates many page frames that will
conflict with the "original" locations of the image data and are considered
as "unsafe", treating them as "eaten" pages (ie.  allocated but unusable).

The patch makes swsusp allocate as many pages as it'll need to store the
data read from the image in one shot, creating a list of allocated "safe"
pages, and use the observation that all pages allocated by it are marked
with the PG_nosave and PG_nosave_free flags set.   Namely, when it's about
to load an image page, swsusp can check whether the page frame
corresponding to the "original" location of this page has been allocated
(ie.  if the page frame has the PG_nosave and PG_nosave_free flags set) and
if so, it can load the page directly into this page frame.   Otherwise it
uses an allocated "safe" page from the list to store the data that will be
copied to their "original" location later on.

This allows us to save many page copyings and page allocations during
resume and in the future it may allow us to load images greater than 50% of
the normal zone.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: "Pavel Machek" <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:00 -07:00
Adrian Bunk
7bff24e255 [PATCH] kernel/power/snapshot.c: cleanups
- make needlessly global functions static
- make dummy functions static inline

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Rafael J. Wysocki
a938c356d5 [PATCH] swsusp: take lowmem reserves into account
swsusp allocates memory from the normal zone, so it cannot use lowmem
reserve pages from the lower zones.  Therefore it should not count these
pages as available to it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Shaohua Li
ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support
1. Add architecture specific pages save/restore support.  Next two patches
   will use this to save/restore 'ACPI NVS' pages.

2. Allow reserved pages 'nosave'.  This could avoid save/restore BIOS
   reserved pages.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Zhang Yanmin
1b61b910e9 [PATCH] x86: kernel irq balance doesn't work
On i386, kernel irq balance doesn't work.

1) In function do_irq_balance, after kernel finds the min_loaded cpu but
   before calling set_pending_irq to really pin the selected_irq to the
   target cpu, kernel does a cpus_and with irq_affinity[selected_irq].
   Later on, when the irq is acked, kernel would calls
   move_native_irq=>desc->handler->set_affinity to change the irq affinity.
    However, every function pointed by
   hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask)
   always changes irq_affinity[irq] to cpumask.  Next time when recalling
   do_irq_balance, it has to do cpu_ands again with
   irq_affinity[selected_irq], but irq_affinity[selected_irq] already
   becomes one cpu selected by the first irq balance.

2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same
   issue.

[akpm@osdl.org: cleanups]
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:57 -07:00
David Quigley
22fb52dd73 [PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c)
Add a security hook call to enable security modules to control the ability
to attach a task to a cpuset.  While limited control over this operation is
possible via permission checks on the pseudo fs interface, those checks are
not sufficient to control access to the target task, which is looked up in
this function.  The existing task_setscheduler hook is re-used for this
operation since this falls under the same class of operations.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:54 -07:00
David Quigley
e7834f8fcc [PATCH] SELinux: add security hooks to {get,set}affinity
This patch adds LSM hooks into the setaffinity and getaffinity functions to
enable security modules to control these operations between tasks with
task_setscheduler and task_getscheduler LSM hooks.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
9216dfad4f [PATCH] move_pages: fix 32 -> 64 bit compat function
The definition of the third parameter is a pointer to an array of virtual
addresses which give us some trouble.  The existing code calculated the
wrong address in the array since I used void to avoid having to specify a
type.

I now use the correct type "compat_uptr_t __user *" in the definition of
the function in kernel/compat.c.

However, I used __u32 in syscalls.h.  Would have to include compat.h there
in order to provide the same definition which would generate an ugly
include situation.

On both ia64 and x86_64 compat_uptr_t is u32. So this works although
parameter declarations differ.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
1b2db9fb7a [PATCH] sys_move_pages: 32bit support (i386, x86_64)
sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)

Add support for move_pages() on i386 and also add the compat functions
necessary to run 32 bit binaries on x86_64.

Add compat_sys_move_pages to the x86_64 32bit binary layer.  Note that it is
not up to date so I added the missing pieces.  Not sure if this is done the
right way.

[akpm@osdl.org: compile fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
742755a1d8 [PATCH] page migration: sys_move_pages(): support moving of individual pages
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

long move_pages(pid, number_of_pages_to_move,
		addresses_of_pages[],
		nodes[] or NULL,
		status[],
		flags);

The addresses of pages is an array of void * pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.

The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.

Possible page states in status[]:

0..MAX_NUMNODES	The page is now on the indicated node.

-ENOENT		Page is not present

-EACCES		Page is mapped by multiple processes and can only
		be moved if MPOL_MF_MOVE_ALL is specified.

-EPERM		The page has been mlocked by a process/driver and
		cannot be moved.

-EBUSY		Page is busy and cannot be moved. Try again later.

-EFAULT		Invalid address (no VMA or zero page).

-ENOMEM		Unable to allocate memory on target node.

-EIO		Unable to write back page. The page must be written
		back in order to move it since the page is dirty and the
		filesystem does not provide a migration function that
		would allow the moving of dirty pages.

-EINVAL		A dirty page cannot be moved. The filesystem does not provide
		a migration function and has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE	Move pages that are only mapped by the process.

MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
		Requires sufficient capabilities.

Possible return codes from move_pages()

-ENOENT		No pages found that would require moving. All pages
		are either already on the target node, not present, had an
		invalid address or could not be moved because they were
		mapped by multiple processes.

-EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
		to migrate pages in a kernel thread.

-EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
		or an attempt to move a process belonging to another user.

-EACCES		One of the target nodes is not allowed by the current cpuset.

-ENODEV		One of the target nodes is not online.

-ESRCH		Process does not exist.

-E2BIG		Too many pages to move.

-ENOMEM		Not enough memory to allocate control array.

-EFAULT		Parameters could not be accessed.

A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

From: Christoph Lameter <clameter@sgi.com>

  Detailed results for sys_move_pages()

  Pass a pointer to an integer to get_new_page() that may be used to
  indicate where the completion status of a migration operation should be
  placed.  This allows sys_move_pags() to report back exactly what happened to
  each page.

  Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Rafael J. Wysocki
d6277db4ab [PATCH] swsusp: rework memory shrinker
Rework the swsusp's memory shrinker in the following way:

- Simplify balance_pgdat() by removing all of the swsusp-related code
  from it.

- Make shrink_all_memory() use shrink_slab() and a new function
  shrink_all_zones() which calls shrink_active_list() and
  shrink_inactive_list() directly for each zone in a way that's optimized
  for suspend.

In shrink_all_memory() we try to free exactly as many pages as the caller
asks for, preferably in one shot, starting from easier targets.   If slab
caches are huge, they are most likely to have enough pages to reclaim.
 The inactive lists are next (the zones with more inactive pages go first)
etc.

Each time shrink_all_memory() attempts to shrink the active and inactive
lists for each zone in 5 passes.   In the first pass, only the inactive
lists are taken into consideration.   In the next two passes the active
lists are also shrunk, but mapped pages are not reclaimed.   In the last
two passes the active and inactive lists are shrunk and mapped pages are
reclaimed as well.  The aim of this is to alter the reclaim logic to choose
the best pages to keep on resume and improve the responsiveness of the
resumed system.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:48 -07:00
KAMEZAWA Hiroyuki
fadd8fbd15 [PATCH] support for panic at OOM
This patch adds panic_on_oom sysctl under sys.vm.

When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
processes.  And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
the same way as it does today.  Of course, the default value is 0 and only
root can modifies it.

In general, oom_killer works well and kill rogue processes.  So the whole
system can survive.  But there are environments where panic is preferable
rather than kill some processes.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:47 -07:00
David Howells
726c334223 [PATCH] VFS: Permit filesystem to perform statfs with a known root dentry
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.

This complements the get_sb() patch.  That reduced the significance of
sb->s_root, allowing NFS to place a fake root there.  However, NFS does
require a dentry to use as a target for the statfs operation.  This permits
the root in the vfsmount to be used instead.

linux/mount.h has been added where necessary to make allyesconfig build
successfully.

Interest has also been expressed for use with the FUSE and XFS filesystems.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
David Howells
454e2398be [PATCH] VFS: Permit filesystem to override root dentry on mount
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.

The filesystem is then required to manually set the superblock and root dentry
pointers.  For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).

The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.

This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing.  In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.

The patch also makes the following changes:

 (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
     pointer argument and return an integer, so most filesystems have to change
     very little.

 (*) If one of the convenience function is not used, then get_sb() should
     normally call simple_set_mnt() to instantiate the vfsmount. This will
     always return 0, and so can be tail-called from get_sb().

 (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
     dcache upon superblock destruction rather than shrink_dcache_anon().

     This is required because the superblock may now have multiple trees that
     aren't actually bound to s_root, but that still need to be cleaned up. The
     currently called functions assume that the whole tree is rooted at s_root,
     and that anonymous dentries are not the roots of trees which results in
     dentries being left unculled.

     However, with the way NFS superblock sharing are currently set to be
     implemented, these assumptions are violated: the root of the filesystem is
     simply a dummy dentry and inode (the real inode for '/' may well be
     inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
     with child trees.

     [*] Anonymous until discovered from another tree.

 (*) The documentation has been adjusted, including the additional bit of
     changing ext2_* into foo_* in the documentation.

[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
Linus Torvalds
45c091bb2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
  [POWERPC] re-enable OProfile for iSeries, using timer interrupt
  [POWERPC] support ibm,extended-*-frequency properties
  [POWERPC] Extra sanity check in EEH code
  [POWERPC] Dont look for class-code in pci children
  [POWERPC] Fix mdelay badness on shared processor partitions
  [POWERPC] disable floating point exceptions for init
  [POWERPC] Unify ppc syscall tables
  [POWERPC] mpic: add support for serial mode interrupts
  [POWERPC] pseries: Print PCI slot location code on failure
  [POWERPC] spufs: one more fix for 64k pages
  [POWERPC] spufs: fail spu_create with invalid flags
  [POWERPC] spufs: clear class2 interrupt status before wakeup
  [POWERPC] spufs: fix Makefile for "make clean"
  [POWERPC] spufs: remove stop_code from struct spu
  [POWERPC] spufs: fix spu irq affinity setting
  [POWERPC] spufs: further abstract priv1 register access
  [POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
  [POWERPC] spufs: dont try to access SPE channel 1 count
  [POWERPC] spufs: use kzalloc in create_spu
  [POWERPC] spufs: fix initial state of wbox file
  ...

Manually resolved conflicts in:
	drivers/net/phy/Makefile
	include/asm-powerpc/spu.h
2006-06-22 22:11:30 -07:00
Ravikiran G Thirumalai
de047c1bcd [PATCH] avoid tasklist_lock at getrusage for multithreaded case too
Avoid taking tasklist_lock for at getrusage for the multithreaded case too.
We don't need to take the tasklist lock for thread traversal of a process
since Oleg's do-__unhash_process-under-siglock.patch and related work.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:57 -07:00
Andrew Morton
6cc0719181 [PATCH] suspend_console() warning fix
kernel/power/main.c: In function 'suspend_prepare':
kernel/power/main.c:89: warning: implicit declaration of function 'suspend_console'
kernel/power/main.c: In function 'suspend_finish':
kernel/power/main.c:137: warning: implicit declaration of function 'resume_console'

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:56 -07:00
Michael LeMay
d720024e94 [PATCH] selinux: add hooks for key subsystem
Introduce SELinux hooks to support the access key retention subsystem
within the kernel.  Incorporate new flask headers from a modified version
of the SELinux reference policy, with support for the new security class
representing retained keys.  Extend the "key_alloc" security hook with a
task parameter representing the intended ownership context for the key
being allocated.  Attach security information to root's default keyrings
within the SELinux initialization routine.

Has passed David's testsuite.

Signed-off-by: Michael LeMay <mdlemay@epoch.ncsc.mil>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:55 -07:00
Linus Torvalds
d9eaec9e29 Merge branch 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (25 commits)
  [PATCH] make set_loginuid obey audit_enabled
  [PATCH] log more info for directory entry change events
  [PATCH] fix AUDIT_FILTER_PREPEND handling
  [PATCH] validate rule fields' types
  [PATCH] audit: path-based rules
  [PATCH] Audit of POSIX Message Queue Syscalls v.2
  [PATCH] fix se_sen audit filter
  [PATCH] deprecate AUDIT_POSSBILE
  [PATCH] inline more audit helpers
  [PATCH] proc_loginuid_write() uses simple_strtoul() on non-terminated array
  [PATCH] update of IPC audit record cleanup
  [PATCH] minor audit updates
  [PATCH] fix audit_krule_to_{rule,data} return values
  [PATCH] add filtering by ppid
  [PATCH] log ppid
  [PATCH] collect sid of those who send signals to auditd
  [PATCH] execve argument logging
  [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES
  [PATCH] audit_panic() is audit-internal
  [PATCH] inotify (5/5): update kernel documentation
  ...

Manual fixup of conflict in unclude/linux/inotify.h
2006-06-20 15:37:56 -07:00
Linus Torvalds
2edc322d42 Merge git://git.infradead.org/~dwmw2/rbtree-2.6
* git://git.infradead.org/~dwmw2/rbtree-2.6:
  [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
  Update UML kernel/physmem.c to use rb_parent() accessor macro
  [RBTREE] Update hrtimers to use rb_parent() accessor macro.
  [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
  [RBTREE] Merge colour and parent fields of struct rb_node.
  [RBTREE] Remove dead code in rb_erase()
  [RBTREE] Update JFFS2 to use rb_parent() accessor macro.
  [RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
  [RBTREE] Update key.c to use rb_parent() accessor macro.
  [RBTREE] Update ext3 to use rb_parent() accessor macro.
  [RBTREE] Change rbtree off-tree marking in I/O schedulers.
  [RBTREE] Add accessor macros for colour and parent fields of rb_node
2006-06-20 14:51:22 -07:00
Linus Torvalds
be967b7e2f Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (199 commits)
  [MTD] NAND: Fix breakage all over the place
  [PATCH] NAND: fix remaining OOB length calculation
  [MTD] NAND Fixup NDFC merge brokeness
  [MTD NAND] S3C2410 driver cleanup
  [MTD NAND] s3c24x0 board: Fix clock handling, ensure proper initialisation.
  [JFFS2] Check CRC32 on dirent and data nodes each time they're read
  [JFFS2] When retiring nextblock, allocate a node_ref for the wasted space
  [JFFS2] Mark XATTR support as experimental, for now
  [JFFS2] Don't trust node headers before the CRC is checked.
  [MTD] Restore MTD_ROM and MTD_RAM types
  [MTD] assume mtd->writesize is 1 for NOR flashes
  [MTD NAND] Fix s3c2410 NAND driver so it at least _looks_ like it compiles
  [MTD] Prepare physmap for 64-bit-resources
  [JFFS2] Fix more breakage caused by janitorial meddling.
  [JFFS2] Remove stray __exit from jffs2_compressors_exit()
  [MTD] Allow alternate JFFS2 mount variant for root filesystem.
  [MTD] Disconnect struct mtd_info from ABI
  [MTD] replace MTD_RAM with MTD_GENERIC_TYPE
  [MTD] replace MTD_ROM with MTD_GENERIC_TYPE
  [MTD] remove a forgotten MTD_XIP
  ...
2006-06-20 14:50:31 -07:00
Steve Grubb
41757106b9 [PATCH] make set_loginuid obey audit_enabled
Hi,

I was doing some testing and noticed that when the audit system was disabled,
I was still getting messages about the loginuid being set. The following patch
makes audit_set_loginuid look at in_syscall to determine if it should create
an audit event. The loginuid will continue to be set as long as there is a context.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:29 -04:00
Amy Griffis
9c937dcc71 [PATCH] log more info for directory entry change events
When an audit event involves changes to a directory entry, include
a PATH record for the directory itself.  A few other notable changes:

    - fixed audit_inode_child() hooks in fsnotify_move()
    - removed unused flags arg from audit_inode()
    - added audit log routines for logging a portion of a string

Here's some sample output.

before patch:
type=SYSCALL msg=audit(1149821605.320:26): arch=40000003 syscall=39 success=yes exit=0 a0=bf8d3c7c a1=1ff a2=804e1b8 a3=bf8d3c7c items=1 ppid=739 pid=800 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
type=CWD msg=audit(1149821605.320:26):  cwd="/root"
type=PATH msg=audit(1149821605.320:26): item=0 name="foo" parent=164068 inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

after patch:
type=SYSCALL msg=audit(1149822032.332:24): arch=40000003 syscall=39 success=yes exit=0 a0=bfdd9c7c a1=1ff a2=804e1b8 a3=bfdd9c7c items=2 ppid=714 pid=777 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
type=CWD msg=audit(1149822032.332:24):  cwd="/root"
type=PATH msg=audit(1149822032.332:24): item=0 name="/root" inode=164068 dev=03:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_dir_t:s0
type=PATH msg=audit(1149822032.332:24): item=1 name="foo" inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:28 -04:00
Amy Griffis
6a2bceec0e [PATCH] fix AUDIT_FILTER_PREPEND handling
Clear AUDIT_FILTER_PREPEND flag after adding rule to list.  This
fixes three problems when a rule is added with the -A syntax:

    - auditctl displays filter list as "(null)"
    - the rule cannot be removed using -d
    - a duplicate rule can be added with -a

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:28 -04:00
Al Viro
0a73dccc4f [PATCH] validate rule fields' types
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:27 -04:00
Amy Griffis
f368c07d72 [PATCH] audit: path-based rules
In this implementation, audit registers inotify watches on the parent
directories of paths specified in audit rules.  When audit's inotify
event handler is called, it updates any affected rules based on the
filesystem event.  If the parent directory is renamed, removed, or its
filesystem is unmounted, audit removes all rules referencing that
inotify watch.

To keep things simple, this implementation limits location-based
auditing to the directory entries in an existing directory.  Given
a path-based rule for /foo/bar/passwd, the following table applies:

    passwd modified -- audit event logged
    passwd replaced -- audit event logged, rules list updated
    bar renamed     -- rule removed
    foo renamed     -- untracked, meaning that the rule now applies to
		       the new location

Audit users typically want to have many rules referencing filesystem
objects, which can significantly impact filtering performance.  This
patch also adds an inode-number-based rule hash to mitigate this
situation.

The patch is relative to the audit git tree:
http://kernel.org/git/?p=linux/kernel/git/viro/audit-current.git;a=summary
and uses the inotify kernel API:
http://lkml.org/lkml/2006/6/1/145

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:27 -04:00
George C. Wilson
20ca73bc79 [PATCH] Audit of POSIX Message Queue Syscalls v.2
This patch adds audit support to POSIX message queues.  It applies cleanly to
the lspp.b15 branch of Al Viro's git tree.  There are new auxiliary data
structures, and collection and emission routines in kernel/auditsc.c.  New hooks
in ipc/mqueue.c collect arguments from the syscalls.

I tested the patch by building the examples from the POSIX MQ library tarball.
Build them -lrt, not against the old MQ library in the tarball.  Here's the URL:
http://www.geocities.com/wronski12/posix_ipc/libmqueue-4.41.tar.gz
Do auditctl -a exit,always -S for mq_open, mq_timedsend, mq_timedreceive,
mq_notify, mq_getsetattr.  mq_unlink has no new hooks.  Please see the
corresponding userspace patch to get correct output from auditd for the new
record types.

[fixes folded]

Signed-off-by: George Wilson <ltcgcw@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:26 -04:00
Al Viro
014149cce1 [PATCH] deprecate AUDIT_POSSBILE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:25 -04:00
Al Viro
d8945bb51a [PATCH] inline more audit helpers
pull checks for ->audit_context into inlined wrappers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:25 -04:00
Linda Knippers
ac03221a4f [PATCH] update of IPC audit record cleanup
The following patch addresses most of the issues with the IPC_SET_PERM
records as described in:
https://www.redhat.com/archives/linux-audit/2006-May/msg00010.html
and addresses the comments I received on the record field names.

To summarize, I made the following changes:

1. Changed sys_msgctl() and semctl_down() so that an IPC_SET_PERM
   record is emitted in the failure case as well as the success case.
   This matches the behavior in sys_shmctl().  I could simplify the
   code in sys_msgctl() and semctl_down() slightly but it would mean
   that in some error cases we could get an IPC_SET_PERM record
   without an IPC record and that seemed odd.

2. No change to the IPC record type, given no feedback on the backward
   compatibility question.

3. Removed the qbytes field from the IPC record.  It wasn't being
   set and when audit_ipc_obj() is called from ipcperms(), the
   information isn't available.  If we want the information in the IPC
   record, more extensive changes will be necessary.  Since it only
   applies to message queues and it isn't really permission related, it
   doesn't seem worth it.

4. Removed the obj field from the IPC_SET_PERM record.  This means that
   the kern_ipc_perm argument is no longer needed.

5. Removed the spaces and renamed the IPC_SET_PERM field names.  Replaced iuid and
   igid fields with ouid and ogid in the IPC record.

I tested this with the lspp.22 kernel on an x86_64 box.  I believe it
applies cleanly on the latest kernel.

-- ljk

Signed-off-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:24 -04:00
Serge E. Hallyn
5d136a010d [PATCH] minor audit updates
Just a few minor proposed updates.  Only the last one will
actually affect behavior.  The rest are just misleading
code.

Several AUDIT_SET functions return 'old' value, but only
return value <0 is checked for.  So just return 0.

propagate audit_set_rate_limit and audit_set_backlog_limit
error values

In audit_buffer_free, the audit_freelist_count was being
incremented even when we discard the return buffer, so
audit_freelist_count can end up wrong.  This could cause
the actual freelist to shrink over time, eventually
threatening to degrate audit performance.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:23 -04:00
Amy Griffis
0a3b483e83 [PATCH] fix audit_krule_to_{rule,data} return values
Don't return -ENOMEM when callers of these functions are checking for
a NULL return.  Bug noticed by Serge Hallyn.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:23 -04:00
Al Viro
3c66251e57 [PATCH] add filtering by ppid
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:22 -04:00
Al Viro
f46038ff7d [PATCH] log ppid
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:22 -04:00
Al Viro
e1396065e0 [PATCH] collect sid of those who send signals to auditd
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:21 -04:00
Al Viro
473ae30bc7 [PATCH] execve argument logging
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:21 -04:00
Al Viro
9044e6bca5 [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES
We should not send a pile of replies while holding audit_netlink_mutex
since we hold the same mutex when we receive commands.  As the result,
we can get blocked while sending and sit there holding the mutex while
auditctl is unable to send the next command and get around to receiving
what we'd sent.

Solution: create skb and put them into a queue instead of sending;
once we are done, send what we've got on the list.  The former can
be done synchronously while we are handling AUDIT_LIST or AUDIT_LIST_RULES;
we are holding audit_netlink_mutex at that point.  The latter is done
asynchronously and without messing with audit_netlink_mutex.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:20 -04:00
Amy Griffis
2d9048e201 [PATCH] inotify (1/5): split kernel API from userspace support
The following series of patches introduces a kernel API for inotify,
making it possible for kernel modules to benefit from inotify's
mechanism for watching inodes.  With these patches, inotify will
maintain for each caller a list of watches (via an embedded struct
inotify_watch), where each inotify_watch is associated with a
corresponding struct inode.  The caller registers an event handler and
specifies for which filesystem events their event handler should be
called per inotify_watch.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:17 -04:00
Linus Torvalds
557240b48e Add support for suspending and resuming the whole console subsystem
Trying to suspend/resume with console messages flying all around is
doomed to failure, when the devices that the messages are trying to
go to are being shut down.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-19 18:16:01 -07:00
Oleg Nesterov
f53ae1dc34 [PATCH] arm_timer: remove a racy and obsolete PF_EXITING check
arm_timer() checks PF_EXITING to prevent BUG_ON(->exit_state)
in run_posix_cpu_timers().

However, for some reason it does so only for CPUCLOCK_PERTHREAD
case (which is imho wrong).

Also, this check is not reliable, PF_EXITING could be set on
another cpu without any locks/barriers just after the check,
so it can't prevent from attaching the timer to the exiting
task.

The previous patch makes this check unneeded.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Oleg Nesterov
30f1e3dd8c [PATCH] run_posix_cpu_timers: remove a bogus BUG_ON()
do_exit() clears ->it_##clock##_expires, but nothing prevents
another cpu to attach the timer to exiting process after that.
arm_timer() tries to protect against this race, but the check
is racy.

After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and
before do_exit() calls 'schedule() local timer interrupt can find
tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu
does sys_wait4) interrupted task has ->signal == NULL.

At this moment exiting task has no pending cpu timers, they were
cleanuped in __exit_signal()->posix_cpu_timers_exit{,_group}(),
so we can just return from irq.

John Stultz recently confirmed this bug, see

	http://marc.theaimsgroup.com/?l=linux-kernel&m=115015841413687

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Oleg Nesterov
8f17fc20bf [PATCH] check_process_timers: fix possible lockup
If the local timer interrupt happens just after do_exit() sets PF_EXITING
(and before it clears ->it_xxx_expires) run_posix_cpu_timers() will call
check_process_timers() with tasklist_lock + ->siglock held and

	check_process_timers:

		t = tsk;
		do {
			....

			do {
				t = next_thread(t);
			} while (unlikely(t->flags & PF_EXITING));
		} while (t != tsk);

the outer loop will never stop.

Actually, the window is bigger.  Another process can attach the timer
after ->it_xxx_expires was cleared (see the next commit) and the 'if
(PF_EXITING)' check in arm_timer() is racy (see the one after that).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Sam Ravnborg
b817f6feff kbuild: check license compatibility when building modules
Modules that uses GPL symbols can no longer be build with kbuild,
the build will fail during the modpost step.
When a GPL-incompatible module uses a EXPORT_SYMBOL_GPL_FUTURE symbol
then warn during modpost so author are actually notified.

The actual license compatibility check is shared with the kernel
to make sure it is in sync.

Patch originally from: Andreas Gruenbacher <agruen@suse.de> and
Ram Pai <linuxram@us.ibm.com>

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2006-06-09 21:53:55 +02:00
Anton Blanchard
651d765d0b [PATCH] Add a prctl to change the endianness of a process.
This new prctl is intended for changing the execution mode of the
processor, on processors that support both a little-endian mode and a
big-endian mode.  It is intended for use by programs such as
instruction set emulators (for example an x86 emulator on PowerPC),
which may find it convenient to use the processor in an alternate
endianness mode when executing translated instructions.

Note that this does not imply the existence of a fully-fledged ABI for
both endiannesses, or of compatibility code for converting system
calls done in the non-native endianness mode.  The program is expected
to arrange for all of its system call arguments to be presented in the
native endianness.

Switching between big and little-endian mode will require some care in
constructing the instruction sequence for the switch.  Generally the
instructions up to the instruction that invokes the prctl system call
will have to be in the old endianness, and subsequent instructions
will have to be in the new endianness.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-09 21:24:13 +10:00
Stephen Hemminger
8d16b76421 [PATCH] hrtimer: export symbols
From: Stephen Hemminger <shemminger@osdl.org>

I want to use the hrtimer's in the netem (Network Emulator) qdisc.  But the
necessary symbols aren't exported for module use.

Also needed by SystemTap.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Stone, Joshua I" <joshua.i.stone@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-31 16:27:11 -07:00
Linus Torvalds
f1adad78dd Revert "[PATCH] sched: fix interactive task starvation"
This reverts commit 5ce74abe78 (and its
dependent commit 8a5bc075b8), because of
audio underruns.

Reported by Rene Herman <rene.herman@keyaccess.nl>, who also pinpointed
the exact cause of the underruns:

  "Audio underruns galore, with only ogg123 and firefox (browsing the
   GIT tree online is also a nice trigger by the way).

   If I back it out, everything is fine for me again."

Cc: Rene Herman <rene.herman@keyaccess.nl>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 18:54:09 -07:00
Zachary Amsden
0662b71322 [PATCH] Fix a NO_IDLE_HZ timer bug
Under certain timing conditions, a race during boot occurs where timer
ticks are being processed on remote CPUs.  The remote timer ticks can
increment jiffies, and if this happens during a window when a timeout is
very close to expiring but a local tick has not yet been delivered, you can
end up with

1) No softirq pending
2) A local timer wheel which is not synced to jiffies
3) No high resolution timer active
4) A local timer which is supposed to fire before the current jiffies value.

In this circumstance, the comparison in next_timer_interrupt overflows,
because the base of the comparison for high resolution timers is jiffies,
but for the softirq timer wheel, it is relative the the current base of the
wheel (jiffies_base).

Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:21 -07:00
Paul Jackson
92d1dbd274 [PATCH] cpuset: might_sleep_if check in cpuset_zones_allowed
It's too easy to incorrectly call cpuset_zone_allowed() in an atomic
context without __GFP_HARDWALL set, and when done, it is not noticed until
a tight memory situation forces allocations to be tried outside the current
cpuset.

Add a 'might_sleep_if()' check, to catch this earlier on, instead of
waiting for a similar check in the mutex_lock() code, which is only rarely
invoked.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:18 -07:00
Paul Jackson
36be57ffe3 [PATCH] cpuset: update cpuset_zones_allowed comment
Update the kernel/cpuset.c:cpuset_zone_allowed() comment.

The rule for when mm/page_alloc.c should call cpuset_zone_allowed()
was intended to be:

  Don't call cpuset_zone_allowed() if you can't sleep, unless you
  pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
  the code that might scan up ancestor cpusets and sleep.

The explanation of this rule in the comment above cpuset_zone_allowed() was
stale, as a result of a restructuring of some __alloc_pages() code in
November 2005.

Rewrite that comment ...

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:18 -07:00
David Woodhouse
18594822fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-16 01:19:52 +01:00
Trent Piepho
5e37661389 [PATCH] symbol_put_addr() locks kernel
Even since a previous patch:

Fix race between CONFIG_DEBUG_SLABALLOC and modules
Sun, 27 Jun 2004 17:55:19 +0000 (17:55 +0000)
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=92b3db26d31cf21b70e3c1eadc56c179506d8fbe

The function symbol_put_addr() will deadlock the kernel.

symbol_put_addr() would acquire modlist_lock, then while holding the lock call
two functions kernel_text_address() and module_text_address() which also try
to acquire the same lock.  This deadlocks the kernel of course.

This patch changes symbol_put_addr() to not acquire the modlist_lock, it
doesn't need it since it never looks at the module list directly.  Also, it
now uses core_kernel_text() instead of kernel_text_address().  The latter has
an additional check for addr inside a module, but we don't need to do that
since we call module_text_address() (the same function kernel_text_address
uses) ourselves.

Signed-off-by: Trent Piepho <xyzzy@speakeasy.org>
Cc: Zwane Mwaikambo <zwane@fsmlabs.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Johannes Stezenbach <js@linuxtv.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15 11:20:55 -07:00
Heiko Carstens
986733e01d [PATCH] RCU: introduce rcu_needs_cpu() interface
With "Paul E. McKenney" <paulmck@us.ibm.com>

Introduce rcu_needs_cpu() interface.  This can be used to tell if there
will be a new rcu batch on a cpu soon by looking at the curlist pointer.
This can be used to avoid to enter a tickless idle state where the cpu
would miss that a new batch is ready when rcu_start_batch would be called
on a different cpu.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15 11:20:55 -07:00
Linus Torvalds
f358166a94 ptrace_attach: fix possible deadlock schenario with irqs
Eric Biederman points out that we can't take the task_lock while holding
tasklist_lock for writing, because another CPU that holds the task lock
might take an interrupt that then tries to take tasklist_lock for writing.

Which would be a nasty deadlock, with one CPU spinning forever in an
interrupt handler (although admittedly you need to really work at
triggering it ;)

Since the ptrace_attach() code is special and very unusual, just make it
be extra careful, and use trylock+repeat to avoid the possible deadlock.

Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-11 11:08:49 -07:00
David Woodhouse
6f18a022fb Finally remove the obnoxious inter_module_xxx()
This was already a bad plan when I argued against adding it in the first
place. Good riddance.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-08 22:40:05 +01:00
Linus Torvalds
f5b40e363a Fix ptrace_attach()/ptrace_traceme()/de_thread() race
This holds the task lock (and, for ptrace_attach, the tasklist_lock)
over the actual attach event, which closes a race between attacking to a
thread that is either doing a PTRACE_TRACEME or getting de-threaded.

Thanks to Oleg Nesterov for reminding me about this, and Chris Wright
for noticing a lost return value in my first version.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-07 10:49:33 -07:00
Steve Grubb
2ad312d209 [PATCH] Audit Filter Performance
While testing the watch performance, I noticed that selinux_task_ctxid()
was creeping into the results more than it should. Investigation showed
that the function call was being called whether it was needed or not. The
below patch fixes this.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:07 -04:00
Steve Grubb
073115d6b2 [PATCH] Rework of IPC auditing
1) The audit_ipc_perms() function has been split into two different
functions:
        - audit_ipc_obj()
        - audit_ipc_set_perm()

There's a key shift here...  The audit_ipc_obj() collects the uid, gid,
mode, and SElinux context label of the current ipc object.  This
audit_ipc_obj() hook is now found in several places.  Most notably, it
is hooked in ipcperms(), which is called in various places around the
ipc code permforming a MAC check.  Additionally there are several places
where *checkid() is used to validate that an operation is being
performed on a valid object while not necessarily having a nearby
ipcperms() call.  In these locations, audit_ipc_obj() is called to
ensure that the information is captured by the audit system.

The audit_set_new_perm() function is called any time the permissions on
the ipc object changes.  In this case, the NEW permissions are recorded
(and note that an audit_ipc_obj() call exists just a few lines before
each instance).

2) Support for an AUDIT_IPC_SET_PERM audit message type.  This allows
for separate auxiliary audit records for normal operations on an IPC
object and permissions changes.  Note that the same struct
audit_aux_data_ipcctl is used and populated, however there are separate
audit_log_format statements based on the type of the message.  Finally,
the AUDIT_IPC block of code in audit_free_aux() was extended to handle
aux messages of this new type.  No more mem leaks I hope ;-)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:04 -04:00
Steve Grubb
ce29b682e2 [PATCH] More user space subject labels
Hi,

The patch below builds upon the patch sent earlier and adds subject label to
all audit events generated via the netlink interface. It also cleans up a few
other minor things.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:01 -04:00
Steve Grubb
e7c3497013 [PATCH] Reworked patch for labels on user space messages
The below patch should be applied after the inode and ipc sid patches.
This patch is a reworking of Tim's patch that has been updated to match
the inode and ipc patches since its similar.

[updated:
>  Stephen Smalley also wanted to change a variable from isec to tsec in the
>  user sid patch.                                                              ]

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:58 -04:00
Steve Grubb
9c7aa6aa74 [PATCH] change lspp ipc auditing
Hi,

The patch below converts IPC auditing to collect sid's and convert to context
string only if it needs to output an audit record. This patch depends on the
inode audit change patch already being applied.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:56 -04:00
Steve Grubb
1b50eed9ca [PATCH] audit inode patch
Previously, we were gathering the context instead of the sid. Now in this patch,
we gather just the sid and convert to context only if an audit event is being
output.

This patch brings the performance hit from 146% down to 23%

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:53 -04:00
Darrel Goeddel
3dc7e3153e [PATCH] support for context based audit filtering, part 2
This patch provides the ability to filter audit messages based on the
elements of the process' SELinux context (user, role, type, mls sensitivity,
and mls clearance).  It uses the new interfaces from selinux to opaquely
store information related to the selinux context and to filter based on that
information.  It also uses the callback mechanism provided by selinux to
refresh the information when a new policy is loaded.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:36 -04:00
Al Viro
97e94c4530 [PATCH] no need to wank with task_lock() and pinning task down in audit_syscall_exit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:21 -04:00
Al Viro
5411be59db [PATCH] drop task argument of audit_syscall_{entry,exit}
... it's always current, and that's a good thing - allows simpler locking.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:18 -04:00
Al Viro
e495149b17 [PATCH] drop gfp_mask in audit_log_exit()
now we can do that - all callers are process-synchronous and do not hold
any locks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:16 -04:00
Al Viro
fa84cb935d [PATCH] move call of audit_free() into do_exit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:13 -04:00
Al Viro
45d9bb0e37 [PATCH] deal with deadlocks in audit_free()
Don't assume that audit_log_exit() et.al. are called for the context of
current; pass task explictly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:07 -04:00
Andrew Morton
13e87ec686 [PATCH] request_irq(): remove warnings from irq probing
- Add new SA_PROBEIRQ which suppresses the new sharing-mismatch warning.
  Some drivers like to use request_irq() to find an unused interrupt slot.

- Use it in i82365.c

- Kill unused SA_PROBE.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 08:33:46 -07:00
dean gaudet
47bb789973 [PATCH] off-by-1 in kernel/power/main.c
There's an off-by-1 in kernel/power/main.c:state_store() ...  if your
kernel just happens to have some non-zero data at pm_states[PM_SUSPEND_MAX]
(i.e.  one past the end of the array) then it'll let you write anything you
want to /sys/power/state and in response the box will enter S5.

Signed-off-by: dean gaudet <dean@arctic.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 08:33:46 -07:00
Chandra Seetharaman
83d722f7e1 [PATCH] Remove __devinit and __cpuinit from notifier_call definitions
Few of the notifier_chain_register() callers use __init in the definition
of notifier_call.  It is incorrect as the function definition should be
available after the initializations (they do not unregister them during
initializations).

This patch fixes all such usages to _not_ have the notifier_call __init
section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:30:03 -07:00
Chandra Seetharaman
649bbaa484 [PATCH] Remove __devinitdata from notifier block definitions
Few of the notifier_chain_register() callers use __devinitdata in the
definition of notifier_block data structure.  It is incorrect as the
data structure should be available after the initializations (they do
not unregister them during initializations).

This was leading to an oops when notifier_chain_register() call is
invoked for those callback chains after initialization.

This patch fixes all such usages to _not_ have the notifier_block data
structure in the init data section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:27:50 -07:00
David Woodhouse
ed198cb497 [RBTREE] Update hrtimers to use rb_parent() accessor macro.
Also switch it to use the same method of using off-tree nodes as
everyone else now does -- set them to point to themselves.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-04-22 02:38:50 +01:00
Linus Torvalds
402a26f0c0 Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block:
  [PATCH] block/elevator.c: remove unused exports
  [PATCH] splice: fix smaller sized splice reads
  [PATCH] Don't inherit ->splice_pipe across forks
  [patch] cleanup: use blk_queue_stopped
  [PATCH] Document online io scheduler switching
2006-04-20 08:17:04 -07:00
Ananth N Mavinakayanahalli
7522a8423b [PATCH] kprobes: NULL out non-relevant fields in struct kretprobe
In cases where a struct kretprobe's *_handler fields are non-NULL, it is
possible to cause a system crash, due to the possibility of calls ending up
in zombie functions.  Documentation clearly states that unused *_handlers
should be set to NULL, but kprobe users sometimes fail to do so.

Fix it by setting the non-relevant fields of the struct kretprobe to NULL.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20 07:54:03 -07:00
Jens Axboe
a0aa7f68af [PATCH] Don't inherit ->splice_pipe across forks
It's really task private, so clear that field on fork after copying
task structure.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-20 13:05:33 +02:00
OGAWA Hirofumi
5a7b46b369 [PATCH] Add more prevent_tail_call()
Those also break userland regs like following.

   00000000 <sys_chown16>:
      0:	0f b7 44 24 0c       	movzwl 0xc(%esp),%eax
      5:	83 ca ff             	or     $0xffffffff,%edx
      8:	0f b7 4c 24 08       	movzwl 0x8(%esp),%ecx
      d:	66 83 f8 ff          	cmp    $0xffffffff,%ax
     11:	0f 44 c2             	cmove  %edx,%eax
     14:	66 83 f9 ff          	cmp    $0xffffffff,%cx
     18:	0f 45 d1             	cmovne %ecx,%edx
     1b:	89 44 24 0c          	mov    %eax,0xc(%esp)
     1f:	89 54 24 08          	mov    %edx,0x8(%esp)
     23:	e9 fc ff ff ff       	jmp    24 <sys_chown16+0x24>

where the tailcall at the end overwrites the incoming stack-frame.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
[ I would _really_ like to have a way to tell gcc about calling
  conventions. The "prevent_tail_call()" macro is pretty ugly ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 16:27:18 -07:00
Rafael J. Wysocki
4a3b98a422 [PATCH] swsusp: prevent possible image corruption on resume
The function free_pagedir() used by swsusp for freeing its internal data
structures clears the PG_nosave and PG_nosave_free flags for each page
being freed.

However, during resume PG_nosave_free set means that the page in
question is "unsafe" (ie.  it will be overwritten in the process of
restoring the saved system state from the image), so it should not be
used for the image data.

Therefore free_pagedir() should not clear PG_nosave_free if it's called
during resume (otherwise "unsafe" pages freed by it may be used for
storing the image data and the data may get corrupted later on).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 09:13:49 -07:00
Eric W. Biederman
5e85d4abe3 [PATCH] task: Make task list manipulations RCU safe
While we can currently walk through thread groups, process groups, and
sessions with just the rcu_read_lock, this opens the door to walking the
entire task list.

We already have all of the other RCU guarantees so there is no cost in
doing this, this should be enough so that proc can stop taking the
tasklist lock during readdir.

prev_task was killed because it has no users, and using it will miss new
tasks when doing an rcu traversal.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 09:13:49 -07:00
Eric W. Biederman
64541d1970 [PATCH] kill unushed __put_task_struct_cb
Somehow in the midst of dotting i's and crossing t's during
the merge up to rc1 we wound up keeping __put_task_struct_cb
when it should have been killed as it no longer has any users.
Sorry I probably should have caught this while it was
still in the -mm tree.

Having the old code there gets confusing when reading
through the code and trying to understand what is
happening.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14 17:43:57 -07:00
Adrian Bunk
78a596b449 [PATCH] remove kernel/power/pm.c:pm_unregister()
Since the last user is removed in -mm, we can now remove this long deprecated
function.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-14 12:25:26 -07:00
Roland McGrath
e57a505984 [PATCH] fix non-leader exec under ptrace
This reverts most of commit 30e0fca6c1.
It broke the case of non-leader MT exec when ptraced.
I think the bug it was intended to fix was already addressed by commit
788e05a67c.

Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14 08:59:13 -07:00
Oleg Nesterov
a145410dcc [PATCH] __group_complete_signal: remove bogus BUG_ON
Commit e56d090310

   [PATCH] RCU signal handling

made this BUG_ON() unsafe. This code runs under ->siglock,
while switch_exec_pids() takes tasklist_lock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 07:34:01 -07:00
Linus Torvalds
88dd9c16ce Merge branch 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block
* 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
  [PATCH] vfs: add splice_write and splice_read to documentation
  [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_*
  [PATCH] splice: warning fix
  [PATCH] another round of fs/pipe.c cleanups
  [PATCH] splice: comment styles
  [PATCH] splice: add Ingo as addition copyright holder
  [PATCH] splice: unlikely() optimizations
  [PATCH] splice: speedups and optimizations
  [PATCH] pipe.c/fifo.c code cleanups
  [PATCH] get rid of the PIPE_*() macros
  [PATCH] splice: speedup __generic_file_splice_read
  [PATCH] splice: add direct fd <-> fd splicing support
  [PATCH] splice: add optional input and output offsets
  [PATCH] introduce a "kernel-internal pipe object" abstraction
  [PATCH] splice: be smarter about calling do_page_cache_readahead()
  [PATCH] splice: optimize the splice buffer mapping
  [PATCH] splice: cleanup __generic_file_splice_read()
  [PATCH] splice: only call wake_up_interruptible() when we really have to
  [PATCH] splice: potential !page dereference
  [PATCH] splice: mark the io page as accessed
2006-04-11 06:34:02 -07:00
Joe Korty
5ef37b1964 [PATCH] add cpu_relax to hrtimer_cancel
Add a cpu_relax() to the hand-coded spinwait in hrtimer_cancel().

Signed-off-by: Joe Korty <joe.korty@ccur.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:42 -07:00
Christoph Hellwig
d824e66a9a [PATCH] build kernel/irq/migration.c only if CONFIG_GENERIC_PENDING_IRQ is set
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:41 -07:00
Adrian Bunk
aa7271076a [PATCH] the scheduled unexport of panic_timeout
Implement the scheduled unexport of panic_timeout.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:40 -07:00
Andrew Morton
ba6edfcd17 [PATCH] timer initialisation fix
We need the boot CPU's tvec_bases[] entry to be initialised super-early in
boot, for early_serial_setup().  That runs within setup_arch(), before even
per-cpu areas are initialised.

The patch changes tvec_bases to use compile-time initialisation, and adds a
separate array `tvec_base_done' to keep track of which CPU has had its
tvec_bases[] entry initialised (because we can no longer use the zeroness of
that tvec_bases[] entry to determine whether it has been initialised).

Thanks to Eugene Surovegin <ebs@ebshome.net> for diagnosing this.

Cc: Eugene Surovegin <ebs@ebshome.net>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:40 -07:00
Hyok S. Choi
3016b42153 [PATCH] frv: define MMU mode specific syscalls as 'cond_syscall' and clean up unneeded macros
For some architectures, a few syscalls are not linked in noMMU mode.  In
that case, the MMU depending syscalls are needed to be defined as
'cond_syscall'.  For example, ARM architecture selectively links sys_mlock
by the mode configuration.

In case of FRV, it has been managed by #ifdef CONFIG_MMU macro in
arch/frv/kernel/entry.S.  However these conditional macros are just
duplicates if they were defined as cond_syscall.  Compilation test is done
with FRV toolchains for both of MMU and noMMU mode.

Signed-off-by: Hyok S. Choi <hyok.choi@samsung.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:33 -07:00
Mike Galbraith
8a5bc075b8 [PATCH] sched: don't awaken RT tasks on expired array
RT tasks are being awakened on the expired array when expired_starving() is
true, whereas they really should be excluded.  Fix.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
Mike Galbraith
5ce74abe78 [PATCH] sched: fix interactive task starvation
Fix a starvation problem that occurs when a stream of highly interactive tasks
delay an array switch for extended periods despite EXPIRED_STARVING(rq) being
true.  AFAIKT, the only choice is to enqueue awakening tasks on the expired
array in this case.

Without this patch, it can be nearly impossible to remotely login to a busy
server, and interactive shell commands can starve for minutes.

Also, convert the EXPIRED_STARVING macro into an inline function which humans
can understand.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
Jens Axboe
b92ce55893 [PATCH] splice: add direct fd <-> fd splicing support
It's more efficient for sendfile() emulation. Basically we cache an
internal private pipe and just use that as the intermediate area for
pages. Direct splicing is not available from sys_splice(), it is only
meant to be used for sendfile() emulation.

Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
exit for the normal fast path.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-11 13:52:07 +02:00
Jordan Hargrave
b20367a6c2 [PATCH] x86_64: Fix drift with HPET timer enabled
If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
This is due to the HPET timer not being initialized with the correct
setting (still using PIT count).

If HZ changes, this drift can become even more pronounced.

HPET patch initializes tick_nsec with correct tick_nsec settings for
HPET timer.

Vojtech comments:

  "It's not entirely correct (it assumes the HPET ticks totally
   exactly), but it's significantly better than assuming the PIT error
   there."

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09 11:53:53 -07:00
Eric Sesterhenn
9f31252cb6 BUG_ON() Conversion in kernel/signal.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:45:55 +02:00
Eric Sesterhenn
fda8bd78a1 BUG_ON() Conversion in kernel/signal.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:44:47 +02:00
Eric Sesterhenn
524223ca81 BUG_ON() Conversion in kernel/ptrace.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:43:40 +02:00
Kalin KOZHUHAROV
8ba8e95ed1 Fix comments: s/granuality/granularity/
I was grepping through the code and some `grep ganularity -R .` didn't
catch what I thought. Then looking closer I saw the term "granuality"
used in only four places (in comments) and granularity in many more
places describing the same idea. Some other facts:

dictionary.com does not know such a word
define:granuality on google is not found (and pages for granuality are
mostly related to patches to the kernel)
it has not been discussed as a term on LKML, AFAICS (=Can Search)

To be consistent, I think granularity should be used everywhere.

Signed-off-by: Kalin KOZHUHAROV <kalin@thinrope.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:41:22 +02:00
Eric Sesterhenn
8abd8e298e BUG_ON() Conversion in kernel/printk.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:21:17 +02:00
Adrian Bunk
3e6e952d1d help text: SOFTWARE_SUSPEND doesn't need ACPI
The note that SOFTWARE_SUSPEND doesn't need APM is helpful, but nowadays
the information that it doesn't need ACPI, too, is even more helpful.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:03:08 +02:00
Kirill Korotaev
4286229868 [PATCH] wrong error path in dup_fd() leading to oopses in RCU
Wrong error path in dup_fd() - it should return NULL on error,
not an address of already freed memory :/

Triggered by OpenVZ stress test suite.

What is interesting is that it was causing different oopses in RCU like
below:
Call Trace:
   [<c013492c>] rcu_do_batch+0x2c/0x80
   [<c0134bdd>] rcu_process_callbacks+0x3d/0x70
   [<c0126cf3>] tasklet_action+0x73/0xe0
   [<c01269aa>] __do_softirq+0x10a/0x130
   [<c01058ff>] do_softirq+0x4f/0x60
   =======================
   [<c0113817>] smp_apic_timer_interrupt+0x77/0x110
   [<c0103b54>] apic_timer_interrupt+0x1c/0x24
  Code:  Bad EIP value.
   <0>Kernel panic - not syncing: Fatal exception in interrupt

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Dmitry Mishin <dim@openvz.org>
Signed-Off-By: Kirill Korotaev <dev@openvz.org>
Signed-Off-By: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:25:46 -08:00
Eric W. Biederman
92476d7fc0 [PATCH] pidhash: Refactor the pid hash table
Simplifies the code, reduces the need for 4 pid hash tables, and makes the
code more capable.

In the discussions I had with Oleg it was felt that to a large extent the
cleanup itself justified the work.  With struct pid being dynamically
allocated meant we could create the hash table entry when the pid was
allocated and free the hash table entry when the pid was freed.  Instead of
playing with the hash lists when ever a process would attach or detach to a
process.

For myself the fact that it gave what my previous task_ref patch gave for free
with simpler code was a big win.  The problem is that if you hold a reference
to struct task_struct you lock in 10K of low memory.  If you do that in a user
controllable way like /proc does, with an unprivileged but hostile user space
application with typical resource limits of 1000 fds and 100 processes I can
trigger the OOM killer by consuming all of low memory with task structs, on a
machine wight 1GB of low memory.

If I instead hold a reference to struct pid which holds a pointer to my
task_struct, I don't suffer from that problem because struct pid is 2 orders
of magnitude smaller.  In fact struct pid is small enough that most other
kernel data structures dwarf it, so simply limiting the number of referring
data structures is enough to prevent exhaustion of low memory.

This splits the current struct pid into two structures, struct pid and struct
pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
struct pid_link is the per process linkage into the hash tables and lives in
struct task_struct.  struct pid is given an indepedent lifetime, and holds
pointers to each of the pid types.

The independent life of struct pid simplifies attach_pid, and detach_pid,
because we are always manipulating the list of pids and not the hash table.
In addition in giving struct pid an indpendent life it makes the concept much
more powerful.

Kernel data structures can now embed a struct pid * instead of a pid_t and
not suffer from pid wrap around problems or from keeping unnecessarily
large amounts of memory allocated.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:19:00 -08:00
Eric W. Biederman
8c7904a00b [PATCH] task: RCU protect task->usage
A big problem with rcu protected data structures that are also reference
counted is that you must jump through several hoops to increase the reference
count.  I think someone finally implemented atomic_inc_not_zero(&count) to
automate the common case.  Unfortunately this means you must special case the
rcu access case.

When data structures are only visible via rcu in a manner that is not
determined by the reference count on the object (i.e.  tasks are visible until
their zombies are reaped) there is a much simpler technique we can employ.
Simply delaying the decrement of the reference count until the rcu interval is
over.

What that means is that the proc code that looks up a task and later
wants to sleep can now do:

rcu_read_lock();
task = find_task_by_pid(some_pid);
if (task) {
	get_task_struct(task);
}
rcu_read_unlock();

The effect on the rest of the kernel is that put_task_struct becomes cheaper
and immediate, and in the case where the task has been reaped it frees the
task immediate instead of unnecessarily waiting an until the rcu interval is
over.

Cleanup of task_struct does not happen when its reference count drops to
zero, instead cleanup happens when release_task is called.  Tasks can only
be looked up via rcu before release_task is called.  All rcu protected
members of task_struct are freed by release_task.

Therefore we can move call_rcu from put_task_struct into release_task.  And
we can modify release_task to not immediately release the reference count
but instead have it call put_task_struct from the function it gives to
call_rcu.

The end result:

- get_task_struct is safe in an rcu context where we have just looked
  up the task.

- put_task_struct() simplifies into its old pre rcu self.

This reorganization also makes put_task_struct uncallable from modules as
it is not exported but it does not appear to be called from any modules so
this should not be an issue, and is trivially fixed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Andrew Morton
158d9ebd19 [PATCH] resurrect __put_task_struct
This just got nuked in mainline.  Bring it back because Eric's patches use it.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Eric W. Biederman
390e2ff077 [PATCH] Make setsid() more robust
The core problem: setsid fails if it is called by init.  The effect in 2.6.16
and the earlier kernels that have this problem is that if you do a "ps -j 1 or
ps -ej 1" you will see that init and several of it's children have process
group and session == 0.  Instead of process group == session == 1.  Despite
init calling setsid.

The reason it fails is that daemonize calls set_special_pids(1,1) on kernel
threads that are launched before /sbin/init is called.

The only remaining effect in that current->signal->leader == 0 for init
instead of 1.  And the setsid call fails.  No one has noticed because
/sbin/init does not check the return value of setsid.

In 2.4 where we don't have the pidhash table, and daemonize doesn't exist
setsid actually works for init.

I care a lot about pid == 1 not being a special case that we leave broken,
because of the container/jail work that I am doing.

- Carefully allow init (pid == 1) to call setsid despite the kernel using
  its session.

- Use find_task_by_pid instead of find_pid because find_pid taking a
  pidtype is going away.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Thomas Gleixner
9741ef964d [PATCH] futex: check and validate timevals
The futex timeval is not checked for correctness.  The change does not
break existing applications as the timeval is supplied by glibc (and glibc
always passes a correct value), but the glibc-internal tests for this
functionality fail.

Signed-off-by: Thomas Gleixner <tglx@tglx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
d425b274ba [PATCH] sched: activate SCHED BATCH expired
To increase the strength of SCHED_BATCH as a scheduling hint we can
activate batch tasks on the expired array since by definition they are
latency insensitive tasks.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
7c4bb1f9b3 [PATCH] sched: remove on runqueue requeueing
On runqueue time is used to elevate priority in schedule().

In the code it currently requeues tasks even if their priority is not
elevated, which would end up placing them at the end of their runqueue
array effectively delaying them instead of improving their priority.

Bug spotted by Mike Galbraith <efault@gmx.de>

This patch removes this requeueing.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
5138930e6a [PATCH] sched: include noninteractive sleep in idle detect
Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so
they need to be included in the idle detection code.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
e72ff0bb2c [PATCH] sched: dont decrease idle sleep avg
We watch for tasks that sleep extended periods and don't allow one single
prolonged sleep period from elevating priority to maximum bonus to prevent cpu
bound tasks from getting high priority with single long sleeps.  There is a
bug in the current code that also penalises tasks that already have high
priority.  Correct that bug.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Con Kolivas
e7c38cb49c [PATCH] sched: make task_noninteractive use sleep_type
Alterations to the pipe code in the kernel made it possible for relative
starvation to occur with tasks that slept waiting on a pipe getting unfair
priority bonuses even if they were otherwise fully cpu bound so the
TASK_NONINTERACTIVE flag was introduced which prevented any change to
sleep_avg while sleeping waiting on a pipe.  This change also leads to the
converse though, preventing any priority boost from occurring in truly
interactive tasks that wait on pipes.

Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE
which will allow a linear bonus to priority based on sleep time thus allowing
interactive tasks to get high priority if they sleep enough.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Con Kolivas
3dee386e14 [PATCH] sched: cleanup task_activated()
The activated flag in task_struct is used to track different sleep types and
its usage is somewhat obfuscated.  Convert the variable to an enum with more
descriptive names without altering the function.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Jack Steiner
db1b1fefc2 [PATCH] sched: reduce overhead of calc_load
Currently, count_active_tasks() calls both nr_running() &
nr_interruptible().  Each of these functions does a "for_each_cpu" & reads
values from the runqueue of each cpu.  Although this is not a lot of
instructions, each runqueue may be located on different node.  Depending on
the architecture, a unique TLB entry may be required to access each
runqueue.

Since there may be more runqueues than cpu TLB entries, a scan of all
runqueues can trash the TLB.  Each memory reference incurs a TLB miss &
refill.

In addition, the runqueue cacheline that contains nr_running &
nr_uninterruptible may be evicted from the cache between the two passes.
This causes unnecessary cache misses.

Combining nr_running() & nr_interruptible() into a single function
substantially reduces the TLB & cache misses on large systems.  This should
have no measureable effect on smaller systems.

On a 128p IA64 system running a memory stress workload, the new function
reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.

Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Dimitri Sivanich
3055addadb [PATCH] hrtimer: call get_softirq_time() only when necessary in run_hrtimer_queue()
It seems that run_hrtimer_queue() is calling get_softirq_time() more
often than it needs to.

With this patch, it only calls get_softirq_time() if there's a
pending timer.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Thomas Gleixner
669d7868ae [PATCH] hrtimer: use generic sleeper for nanosleep
Replace the nanosleep private sleeper functionality by the generic hrtimer
sleeper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Thomas Gleixner
00362e33f6 [PATCH] hrtimer: create generic sleeper
The removal of the data field in the hrtimer structure enforces the
embedding of the timer into another data structure.  nanosleep now uses a
private implementation of the most common used timer callback function
(simple task wakeup).

In order to avoid the reimplentation of such functionality all over the
place a generic hrtimer_sleeper functionality is created.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Andrew Morton
7529c30116 [PATCH] modules: permit Dual-MIT/GPL licenses
One of the LEDs driver files wants to use this.

Probably drivers/mtd/maps/ipaq-flash.c wants to convert as well - right now
it'll be tainting the kernel.

Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Bowler <jbowler@acm.org>
Cc: "'Richard Purdie'" <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:56 -08:00
Paul Jackson
e4e364e865 [PATCH] cpuset: memory migration interaction fix
Fix memory migration so that it works regardless of what cpuset the invoking
task is in.

If a task invoked a memory migration, by doing one of:

       1) writing a different nodemask to a cpuset 'mems' file, or

       2) writing a tasks pid to a different cpuset's 'tasks' file,
          where the cpuset had its 'memory_migrate' option turned on, then the
          allocation of the new pages for the migrated task(s) was constrained
          by the invoking tasks cpuset.

If this task wasn't in a cpuset that allowed the requested memory nodes, the
memory migration would happen to some other nodes that were in that invoking
tasks cpuset.  This was usually surprising and puzzling behaviour: Why didn't
the pages move?  Why did the pages move -there-?

To fix this, temporarilly change the invoking tasks 'mems_allowed' task_struct
field to the nodes the migrating tasks is moving to, so that new pages can be
allocated there.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:55 -08:00
Paul Jackson
2741a559a0 [PATCH] cpuset: unsafe mm reference fix
Fix unsafe reference to a tasks mm struct, by moving the reference inside of a
convenient nearby properly guarded code block.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:55 -08:00
Paul Jackson
4a01c8d5be [PATCH] cpuset: task_lock comment fix
Fix cpuset comment involving case of a tasks cpuset pointer being NULL.
Thanks to "the_top_cpuset_hack", this code no longer sees NULL task->cpuset
pointers.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:55 -08:00
KaiGai Kohei
bb231fe3a5 [PATCH] Fix pacct bug in multithreading case.
I noticed a bug on the process accounting facility.  In multi-threading
process, some data would be recorded incorrectly when the group_leader dies
earlier than one or more threads.  The attached patch fixes this problem.

See below.  'bugacct' is a test program that create a worker thread after 4
seconds sleeping, then the group_leader dies soon.  The worker thread
consume CPU/Memory for 6 seconds, then exit.  We can estimate 10 seconds as
etime and 6 seconds as stime + utime.  This is a sample program which the
group_leader dies earlier than other threads.

The results of same binary execution on different kernel are below.
-- accounted records --------------------
         |   btime  | utime | stime | etime | minflt | majflt |   comm  |
original | 13:16:40 |  0.00 |  0.00 |  6.10 |    171 |      0 | bugacct |
 patched | 13:20:21 |  5.83 |  0.18 | 10.03 |  32776 |      0 | bugacct |
(*) bugacct allocates 128MB memory, thus 128MB / 4KB = 32768 of minflt is
    appropriate.

-- Test results in original kernel ------
$ date; time -p ./bugacct
Tue Mar 28 13:16:36 JST 2006  <- But pacct said btime is 13:16:40
real 10.11                    <- But pacct said etime is 6.10
user 5.96                     <- But pacct said utime is 0.00
sys 0.14                      <- But pacct said stime is 0.00
$
-- Test results in patched kernel -------
$ date; time -p ./bugacct
Tue Mar 28 13:20:21 JST 2006
real 10.04
user 5.83
sys 0.19
$

In the original 2.6.16 kernel, pacct records btime, utime, stime, etime and
minflt incorrectly.  In my opinion, this problem is caused by an assumption
that group_leader dies last.

The following section calculates process running time for etime and btime.
But it means running time of the thread that dies last, not process.  The
start_time of the first thread in the process (group_leader) should be
reduced from uptime to calculate etime and btime correctly.

   ---- do_acct_process() in kernel/acct.c:
   /* calculate run_time in nsec*/
   do_posix_clock_monotonic_gettime(&uptime);
   run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec;
   run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC
                                   + current->start_time.tv_nsec;
   ----

The following section calculates stime and utime of the process.
But it might count the utime and stime of the group_leader duplicatly
and ignore the utime and stime of the thread dies last, when one or
more threads remain after group_leader dead.
The ac_utime should be calculated as the sum of the signal->utime
and utime of the thread dies last. The ac_stime should be done also.

   ---- do_acct_process() in kernel/acct.c:
   jiffies = cputime_to_jiffies(cputime_add(current->group_leader->utime,
                                            current->signal->utime));
   ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies));
   jiffies = cputime_to_jiffies(cputime_add(current->group_leader->stime,
                                            current->signal->stime));
   ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies));
   ----

The part of the minflt/majflt calculation has same problem.
This patch solves those problems, I think.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:54 -08:00
OGAWA Hirofumi
9b41046cd0 [PATCH] Don't pass boot parameters to argv_init[]
The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).

And __setup() is used in obsolete_checksetup().

	start_kernel()
		-> parse_args()
			-> unknown_bootoption()
				-> obsolete_checksetup()

If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.

If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func().  If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].

Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.

This patch fixes a wrong usage of it, however fixes obvious one only.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:53 -08:00
Oleg Nesterov
a2c348fe01 [PATCH] __mod_timer: simplify ->base changing
Since base and new_base are of the same type now, we can save one 'if'
branch and simplify the code a bit.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:53 -08:00
Oleg Nesterov
3691c5199e [PATCH] kill __init_timer_base in favor of boot_tvec_bases
Commit a4a6198b80:
	[PATCH] tvec_bases too large for per-cpu data

introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at
compile time.  This means we can kill __init_timer_base and move
timer_base_s's content into tvec_t_base_s.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:52 -08:00
Pavel Machek
85b6bce365 [PATCH] Fix suspend with traced tasks
strace /bin/bash misbehaves after resume; this fixes it.

(akpm: it's scary calling refrigerator() in state TASK_TRACED, but it seems to
do the right thing).

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:50 -08:00
Oleg Nesterov
547679087b [PATCH] send_sigqueue: simplify and fix the race
send_sigqueue() checks PF_EXITING, then locks p->sighand->siglock.  This is
unsafe: 'p' can exit in between and set ->sighand = NULL.  The race is
theoretical, the window is tiny and irqs are disabled by the caller, so I
don't think we need the fix for -stable tree.

Convert send_sigqueue() to use lock_task_sighand() helper.

Also, delete 'p->flags & PF_EXITING' re-check, it is unneeded and the
comment is wrong.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
a1d5e21e3e [PATCH] do_notify_parent_cldstop: remove 'to_self' param
The previous patch has changed callsites of do_notify_parent_cldstop() so that
to_self == (->ptrace & PT_PTRACED) always (as it should be).  We can remove
this parameter now.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
883606a7c9 [PATCH] finish_stop: don't check stop_count < 0
Remove an obscure 'stop_count < 0' check in finish_stop().  The previous patch
made this case impossible.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
dac27f4a09 [PATCH] simplify do_signal_stop()
do_signal_stop() considers 'thread_group_empty()' as a special case.
This was needed to avoid taking tasklist_lock. Since this lock is
unneeded any longer, we can remove this special case and simplify
the code even more.

Also, before this patch, finish_stop() was called with stop_count == -1
for 'thread_group_empty()' case. This is not strictly wrong, but confusing
and unneeded.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
a7e5328a06 [PATCH] cleanup __exit_signal->cleanup_sighand path
Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal().  This
makes the exit path more understandable and allows us to do
cleanup_sighand() outside of ->siglock protected section.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
4a2c7a7837 [PATCH] make fork() atomic wrt pgrp/session signals
Eric W. Biederman wrote:
>
> Ok. SUSV3/Posix is clear, fork is atomic with respect
> to signals.  Either a signal comes before or after a
> fork but not during. (See the rationale section).
> http://www.opengroup.org/onlinepubs/000095399/functions/fork.html
>
> The tasklist_lock does not stop forks from adding to a process
> group. The forks stall while the tasklist_lock is held, but a fork
> that began before we grabbed the tasklist_lock simply completes
> afterwards, and the child does not receive the signal.

This also means that SIGSTOP or sig_kernel_coredump() signal can't
be delivered to pgrp/session reliably.

With this patch copy_process() returns -ERESTARTNOINTR when it
detects a pending signal, fork() will be restarted transparently
after handling the signals.

This patch also deletes now unneeded "group_stop_count > 0" check,
copy_process() can no longer succeed while group stop in progress.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
47e65328a7 [PATCH] pids: kill PIDTYPE_TGID
This patch kills PIDTYPE_TGID pid_type thus saving one hash table in
kernel/pid.c and speeding up subthreads create/destroy a bit.  It is also a
preparation for the further tref/pids rework.

This patch adds 'struct list_head thread_group' to 'struct task_struct'
instead.

We don't detach group leader from PIDTYPE_PID namespace until another
thread inherits it's ->pid == ->tgid, so we are safe wrt premature
free_pidmap(->tgid) call.

Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID).
Should the need arise, we can use find_task_by_pid()->group_leader.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
88531f725b [PATCH] do_sigaction: don't take tasklist_lock
do_sigaction() does not need tasklist_lock anymore, we can simplify the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
aacc90944d [PATCH] do_group_exit: don't take tasklist_lock
do_group_exit() takes tasklist_lock for zap_other_threads(), this is unneeded
now.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
a122b341b7 [PATCH] do_signal_stop: don't take tasklist_lock
do_signal_stop() does not need tasklist_lock anymore.  So it does not need to
do misc re-checks, and we can simplify the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
6108ccd3e2 [PATCH] relax sig_needs_tasklist()
handle_stop_signal() does not need tasklist_lock for SIG_KERNEL_STOP_MASK
signals anymore.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
7d7185c818 [PATCH] sys_times: don't take tasklist_lock
sys_times: don't take tasklist_lock

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
5876700cd3 [PATCH] do __unhash_process() under ->siglock
This patch moves __unhash_process() call from realease_task() to
__exit_signal(), so __detach_pid() is called with ->siglock held.

This means we don't need tasklist_lock to iterate over thread group anymore:

	copy_process() was already changed to do attach_pid()
	under ->siglock.

	Eric's "pidhash-kill-switch_exec_pids.patch" from -mm
	changed de_thread() so it doesn't touch PIDTYPE_TGID.

NOTE: de_thread() still needs some attention.  It still changes task->pid
lockless.  Taking ->sighand.siglock here allows to do more tasklist_lock
removals.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
35f5cad8c4 [PATCH] revert "Optimize sys_times for a single thread process"
This patch reverts 'CONFIG_SMP && thread_group_empty()' optimization in
sys_times().  The reason is that the next patch breaks memory ordering which
is needed for that optimization.

tasklist_lock in sys_times() will be eliminated completely by further patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
6a14c5c9da [PATCH] move __exit_signal() to kernel/exit.c
__exit_signal() is private to release_task() now.  I think it is better to
make it static in kernel/exit.c and export flush_sigqueue() instead - this
function is much more simple and straightforward.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
c81addc9d3 [PATCH] rename __exit_sighand to cleanup_sighand
Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to
copy_sighand().

This matches copy_signal/cleanup_signal naming, and I think it is easier to
follow.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
29ff471234 [PATCH] cleanup __exit_signal()
This patch factors out duplicated code under 'if' branches.  Also, BUG_ON()
conversions and whitespace cleanups.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
6b3934ef52 [PATCH] copy_process: cleanup bad_fork_cleanup_signal
__exit_signal() does important cleanups atomically under ->siglock.  It is
also called from copy_process's error path.  This is not good, for example we
can't move __unhash_process() under ->siglock for that reason.

We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under
'bad_fork_cleanup_sighand:' label.  For copy_process() case it is sufficient
to just backout copy_signal(), nothing more.

Again, nobody can see this task yet.  For CLONE_THREAD case we just decrement
signal->count, otherwise nobody can see this ->signal and we can free it
lockless.

This patch assumes it is safe to do exit_thread_group_keys() without
tasklist_lock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
7001510d0c [PATCH] copy_process: cleanup bad_fork_cleanup_sighand
The only caller of exit_sighand(tsk) is copy_process's error path.  We can
call __exit_sighand() directly and kill exit_sighand().

This 'tsk' was not yet registered in pid_hash[] or init_task.tasks, it has no
external references, nobody can see it, and

	IF (clone_flags & CLONE_SIGHAND)
		At least 'current' has a reference to ->sighand, this
		means atomic_dec_and_test(sighand->count) can't be true.

	ELSE
		Nobody can see this ->sighand, this means we can free it
		without any locking.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
a9e88e84b5 [PATCH] introduce sig_needs_tasklist() helper
In my opinion this patch cleans up the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
f63ee72e0f [PATCH] introduce lock_task_sighand() helper
Add lock_task_sighand() helper and converts group_send_sig_info() to use
it.  Hopefully we will have more users soon.

This patch also removes '!sighand->count' and '!p->usage' checks, I think
they both are bogus, racy and unneeded (but probably it makes sense to
restore them as BUG_ON()s).

->sighand is cleared and it's ->count is decremented in release_task() with
sighand->siglock held, so it is a bug to have '!p->usage || !->count' after
we already locked and verified it is the same.  On the other hand, an
already dead task without ->sighand can have a non-zero ->usage due to
ptrace, for example.

If we read the stale value of ->sighand we must see the change after
spin_lock(), because that change was done while holding that same old
->sighand.siglock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
aa1757f90b [PATCH] convert sighand_cache to use SLAB_DESTROY_BY_RCU
This patch borrows a clever Hugh's 'struct anon_vma' trick.

Without tasklist_lock held we can't trust task->sighand until we locked it
and re-checked that it is still the same.

But this means we don't need to defer 'kmem_cache_free(sighand)'.  We can
return the memory to slab immediately, all we need is to be sure that
sighand->siglock can't dissapear inside rcu protected section.

To do so we need to initialize ->siglock inside ctor function,
SLAB_DESTROY_BY_RCU does the rest.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
1f09f9749c [PATCH] release_task: replace open-coded ptrace_unlink()
Use ptrace_unlink() instead of open-coding.  No changes in kernel/exit.o

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
8292d633ad [PATCH] wait_for_helper: trivial style cleanup
Use NULL instead of (... *)0

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
6ac781b11a [PATCH] reparent_thread: use remove_parent/add_parent
Use remove_parent/add_parent instead of open coding.

No changes in kernel/exit.o

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
73b9ebfe12 [PATCH] pidhash: don't count idle threads
fork_idle() does unhash_process() just after copy_process().  Contrary,
boot_cpu's idle thread explicitely registers itself for each pid_type with nr
= 0.

copy_process() already checks p->pid != 0 before process_counts++, I think we
can just skip attach_pid() calls and job control inits for idle threads and
kill unhash_process().  We don't need to cleanup ->proc_dentry in fork_idle()
because with this patch idle threads are never hashed in
kernel/pid.c:pid_hash[].

We don't need to hash pid == 0 in pidmap_init().  free_pidmap() is never
called with pid == 0 arg, so it will never be reused.  So it is still possible
to use pid == 0 in any PIDTYPE_xxx namespace from kernel/pid.c's POV.

However with this patch we don't hash pid == 0 for PIDTYPE_PID case.  We still
have have PIDTYPE_PGID/PIDTYPE_SID entries with pid == 0: /sbin/init and
kernel threads which don't call daemonize().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
c97d98931a [PATCH] kill SET_LINKS/REMOVE_LINKS
Both SET_LINKS() and SET_LINKS/REMOVE_LINKS() have exactly one caller, and
these callers already check thread_group_leader().

This patch kills theese macros, they mix two different things: setting
process's parent and registering it in init_task.tasks list.  Callers are
updated to do these actions by hand.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
9b678ece42 [PATCH] don't use REMOVE_LINKS/SET_LINKS for reparenting
There are places where kernel uses REMOVE_LINKS/SET_LINKS while changing
process's ->parent.  Use add_parent/remove_parent instead, they don't abuse
of global process list.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
8fafabd86f [PATCH] remove add_parent()'s parent argument
add_parent(p, parent) is always called with parent == p->parent, and it makes
no sense to do it differently.  This patch removes this argument.

No changes in affected .o files.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
d799f03597 [PATCH] choose_new_parent: remove unused arg, sanitize exit_state check
'child_reaper' arg is not used in choose_new_parent().

"->exit_state >= EXIT_ZOMBIE" check is a leftover, was
valid when EXIT_ZOMBIE lived in ->state var.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:40 -08:00
Eric W. Biederman
d73d65293e [PATCH] pidhash: kill switch_exec_pids
switch_exec_pids is only called from de_thread by way of exec, and it is
only called when we are exec'ing from a non thread group leader.

Currently switch_exec_pids gives the leader the pid of the thread and
unhashes and rehashes all of the process groups.  The leader is already in
the EXIT_DEAD state so no one cares about it's pids.  The only concern for
the leader is that __unhash_process called from release_task will function
correctly.  If we don't touch the leader at all we know that
__unhash_process will work fine so there is no need to touch the leader.

For the task becomming the thread group leader, we just need to give it the
pid of the old thread group leader, add it to the task list, and attach it
to the session and the process group of the thread group.

Currently de_thread is also adding the task to the task list which is just
silly.

Currently the only leader of __detach_pid besides detach_pid is
switch_exec_pids because of the ugly extra work that was being
performed.

So this patch removes switch_exec_pids because it is doing too much, it is
creating an unnecessary special case in pid.c, duing work duplicated in
de_thread, and generally obscuring what it is going on.

The necessary work is added to de_thread, and it seems to be a little
clearer there what is going on.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:40 -08:00
Eric W. Biederman
fef23e7fbb [PATCH] exec: allow init to exec from any thread.
After looking at the problem of init calling exec some more I figured out
an easy way to make the code work.

The actual symptom without out this patch is that all threads will die
except pid == 1, and the thread calling exec.  The thread calling exec will
wait forever for pid == 1 to die.

Since pid == 1 does not install a handler for SIGKILL it will never die.

This modifies the tests for init from current->pid == 1 to the equivalent
current == child_reaper.  And then it causes exec in the ugly case to
modify child_reaper.

The only weird symptom is that you wind up with an init process that
doesn't have the oldest start time on the box.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:40 -08:00
Andrew Morton
ec7e15d648 [PATCH] compat_sys_futex() warning fix
kernel/futex_compat.c: In function `compat_sys_futex':
kernel/futex_compat.c:140: warning: passing arg 1 of `do_futex' makes integer from pointer without a cast
kernel/futex_compat.c:140: warning: passing arg 5 of `do_futex' makes integer from pointer without a cast

Not sure what Ingo was thinking of here.  Put the casts back in.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:09 -08:00
KAMEZAWA Hiroyuki
0a94502277 [PATCH] for_each_possible_cpu: fixes for generic part
replaces for_each_cpu with for_each_possible_cpu().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:05 -08:00
Eric Sesterhenn
b9e20a9200 [PATCH] Change dash2underscore() return value to char
Since dash2underscore() just operates and returns chars, I guess its safe
to change the return value to a char.  With my .config, this reduces its
size by 5 bytes.

   text    data     bss     dec     hex filename
   4155     152       0    4307    10d3 params.o.orig
   4150     152       0    4302    10ce params.o

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:03 -08:00
Andrew Morton
f83ca9fe3e [PATCH] symversion warning fix
gcc-4.2:

kernel/module.c: In function '__find_symbol':
kernel/module.c:158: warning: the address of '__start___kcrctab', will always evaluate as 'true'
kernel/module.c:165: warning: the address of '__start___kcrctab_gpl', will always evaluate as 'true'
kernel/module.c:182: warning: the address of '__start___kcrctab_gpl_future', will always evaluate as 'true'

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:02 -08:00
Alan Stern
e041c68341 [PATCH] Notifier chain update: API changes
The kernel's implementation of notifier chains is unsafe.  There is no
protection against entries being added to or removed from a chain while the
chain is in use.  The issues were discussed in this thread:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

	"Blocking" chains are always called from a process context
	and the callout routines are allowed to sleep;

	"Atomic" chains can be called from an atomic context and
	the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API.  Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name).  New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain.  The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed.  For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections.  (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with.  For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem.  Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain.  (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization.  Instead we use RCU.  The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications.  None
of them use the raw API, so for the moment it is only a placeholder.

  ATOMIC CHAINS
  -------------
arch/i386/kernel/traps.c:		i386die_chain
arch/ia64/kernel/traps.c:		ia64die_chain
arch/powerpc/kernel/traps.c:		powerpc_die_chain
arch/sparc64/kernel/traps.c:		sparc64die_chain
arch/x86_64/kernel/traps.c:		die_chain
drivers/char/ipmi/ipmi_si_intf.c:	xaction_notifier_list
kernel/panic.c:				panic_notifier_list
kernel/profile.c:			task_free_notifier
net/bluetooth/hci_core.c:		hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_expect_chain
net/ipv6/addrconf.c:			inet6addr_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_expect_chain
net/netlink/af_netlink.c:		netlink_chain

  BLOCKING CHAINS
  ---------------
arch/powerpc/platforms/pseries/reconfig.c:	pSeries_reconfig_chain
arch/s390/kernel/process.c:		idle_chain
arch/x86_64/kernel/process.c		idle_notifier
drivers/base/memory.c:			memory_chain
drivers/cpufreq/cpufreq.c		cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c		cpufreq_transition_notifier_list
drivers/macintosh/adb.c:		adb_client_list
drivers/macintosh/via-pmu.c		sleep_notifier_list
drivers/macintosh/via-pmu68k.c		sleep_notifier_list
drivers/macintosh/windfarm_core.c	wf_client_list
drivers/usb/core/notify.c		usb_notifier_list
drivers/video/fbmem.c			fb_notifier_list
kernel/cpu.c				cpu_chain
kernel/module.c				module_notify_list
kernel/profile.c			munmap_notifier
kernel/profile.c			task_exit_notifier
kernel/sys.c				reboot_notifier_list
net/core/dev.c				netdev_chain
net/decnet/dn_dev.c:			dnaddr_chain
net/ipv4/devinet.c:			inetaddr_chain

It's possible that some of these classifications are wrong.  If they are,
please let us know or submit a patch to fix them.  Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:50 -08:00
Ingo Molnar
8f17d3a504 [PATCH] lightweight robust futexes updates
- fix: initialize the robust list(s) to NULL in copy_process.

- doc update

- cleanup: rename _inuser to _inatomic

- __user cleanups and other small cleanups

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Ingo Molnar
34f192c652 [PATCH] lightweight robust futexes: compat
32-bit syscall compatibility support.  (This patch also moves all futex
related compat functionality into kernel/futex_compat.c.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Ingo Molnar
0771dfefc9 [PATCH] lightweight robust futexes: core
Add the core infrastructure for robust futexes: structure definitions, the new
syscalls and the do_exit() based cleanup mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Siddha, Suresh B
0806903316 [PATCH] sched: fix group power for allnodes_domains
Current sched groups power calculation for allnodes_domains is wrong.  We
should really be using cumulative power of the physical packages in that
group (similar to the calculation in node_domains)

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:44 -08:00
Siddha, Suresh B
1e9f28fa1e [PATCH] sched: new sched domain for representing multi-core
Add a new sched domain for representing multi-core with shared caches
between cores.  Consider a dual package system, each package containing two
cores and with last level cache shared between cores with in a package.  If
there are two runnable processes, with this appended patch those two
processes will be scheduled on different packages.

On such systems, with this patch we have observed 8% perf improvement with
specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2
users).

This new domain will come into play only on multi-core systems with shared
caches.  On other systems, this sched domain will be removed by domain
degeneration code.  This new domain can be also used for implementing power
savings policy (see OLS 2005 CMP kernel scheduler paper for more details..
I will post another patch for power savings policy soon)

Most of the arch/* file changes are for cpu_coregroup_map() implementation.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:43 -08:00
Andreas Mohr
77e4bfbcf0 [PATCH] Small schedule() optimization
small schedule() microoptimization.

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:43 -08:00
Martin Andersson
013d386814 [PATCH] sched: fix task interactivity calculation
Is a truncation error in kernel/sched.c triggered when the nice value is
negative.  The affected code is used in the TASK_INTERACTIVE macro.

The code is:
#define SCALE(v1,v1_max,v2_max) \
	(v1) * (v2_max) / (v1_max)

which is used in this way:
SCALE(TASK_NICE(p), 40, MAX_BONUS)

Comments in the code says:
  * This part scales the interactivity limit depending on niceness.
  *
  * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
  * Here are a few examples of different nice levels:
  *
  *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
  *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
  *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
  *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
  *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
  *
  * (the X axis represents the possible -5 ... 0 ... +5 dynamic
  *  priority range a task can explore, a value of '1' means the
  *  task is rated interactive.)

However, the current code does not scale it linearly and the result differs
from the given examples.  If the mathematical function "floor" is used when
the nice value is negative instead of the truncation one gets when using
integer division, the result conforms to the documentation.

Output of TASK_INTERACTIVE when using the kernel code:
nice    dynamic priorities
-20     1     1     1     1     1     1     1     1     1     0     0
-19     1     1     1     1     1     1     1     1     0     0     0
-18     1     1     1     1     1     1     1     1     0     0     0
-17     1     1     1     1     1     1     1     1     0     0     0
-16     1     1     1     1     1     1     1     1     0     0     0
-15     1     1     1     1     1     1     1     0     0     0     0
-14     1     1     1     1     1     1     1     0     0     0     0
-13     1     1     1     1     1     1     1     0     0     0     0
-12     1     1     1     1     1     1     1     0     0     0     0
-11     1     1     1     1     1     1     0     0     0     0     0
-10     1     1     1     1     1     1     0     0     0     0     0
  -9     1     1     1     1     1     1     0     0     0     0     0
  -8     1     1     1     1     1     1     0     0     0     0     0
  -7     1     1     1     1     1     0     0     0     0     0     0
  -6     1     1     1     1     1     0     0     0     0     0     0
  -5     1     1     1     1     1     0     0     0     0     0     0
  -4     1     1     1     1     1     0     0     0     0     0     0
  -3     1     1     1     1     0     0     0     0     0     0     0
  -2     1     1     1     1     0     0     0     0     0     0     0
  -1     1     1     1     1     0     0     0     0     0     0     0
  0      1     1     1     1     0     0     0     0     0     0     0
  1      1     1     1     1     0     0     0     0     0     0     0
  2      1     1     1     1     0     0     0     0     0     0     0
  3      1     1     1     1     0     0     0     0     0     0     0
  4      1     1     1     0     0     0     0     0     0     0     0
  5      1     1     1     0     0     0     0     0     0     0     0
  6      1     1     1     0     0     0     0     0     0     0     0
  7      1     1     1     0     0     0     0     0     0     0     0
  8      1     1     0     0     0     0     0     0     0     0     0
  9      1     1     0     0     0     0     0     0     0     0     0
10      1     1     0     0     0     0     0     0     0     0     0
11      1     1     0     0     0     0     0     0     0     0     0
12      1     0     0     0     0     0     0     0     0     0     0
13      1     0     0     0     0     0     0     0     0     0     0
14      1     0     0     0     0     0     0     0     0     0     0
15      1     0     0     0     0     0     0     0     0     0     0
16      0     0     0     0     0     0     0     0     0     0     0
17      0     0     0     0     0     0     0     0     0     0     0
18      0     0     0     0     0     0     0     0     0     0     0
19      0     0     0     0     0     0     0     0     0     0     0

Output of TASK_INTERACTIVE when using "floor"
nice    dynamic priorities
-20     1     1     1     1     1     1     1     1     1     0     0
-19     1     1     1     1     1     1     1     1     1     0     0
-18     1     1     1     1     1     1     1     1     1     0     0
-17     1     1     1     1     1     1     1     1     1     0     0
-16     1     1     1     1     1     1     1     1     0     0     0
-15     1     1     1     1     1     1     1     1     0     0     0
-14     1     1     1     1     1     1     1     1     0     0     0
-13     1     1     1     1     1     1     1     1     0     0     0
-12     1     1     1     1     1     1     1     0     0     0     0
-11     1     1     1     1     1     1     1     0     0     0     0
-10     1     1     1     1     1     1     1     0     0     0     0
  -9     1     1     1     1     1     1     1     0     0     0     0
  -8     1     1     1     1     1     1     0     0     0     0     0
  -7     1     1     1     1     1     1     0     0     0     0     0
  -6     1     1     1     1     1     1     0     0     0     0     0
  -5     1     1     1     1     1     1     0     0     0     0     0
  -4     1     1     1     1     1     0     0     0     0     0     0
  -3     1     1     1     1     1     0     0     0     0     0     0
  -2     1     1     1     1     1     0     0     0     0     0     0
  -1     1     1     1     1     1     0     0     0     0     0     0
   0     1     1     1     1     0     0     0     0     0     0     0
   1     1     1     1     1     0     0     0     0     0     0     0
   2     1     1     1     1     0     0     0     0     0     0     0
   3     1     1     1     1     0     0     0     0     0     0     0
   4     1     1     1     0     0     0     0     0     0     0     0
   5     1     1     1     0     0     0     0     0     0     0     0
   6     1     1     1     0     0     0     0     0     0     0     0
   7     1     1     1     0     0     0     0     0     0     0     0
   8     1     1     0     0     0     0     0     0     0     0     0
   9     1     1     0     0     0     0     0     0     0     0     0
  10     1     1     0     0     0     0     0     0     0     0     0
  11     1     1     0     0     0     0     0     0     0     0     0
  12     1     0     0     0     0     0     0     0     0     0     0
  13     1     0     0     0     0     0     0     0     0     0     0
  14     1     0     0     0     0     0     0     0     0     0     0
  15     1     0     0     0     0     0     0     0     0     0     0
  16     0     0     0     0     0     0     0     0     0     0     0
  17     0     0     0     0     0     0     0     0     0     0     0
  18     0     0     0     0     0     0     0     0     0     0     0
  19     0     0     0     0     0     0     0     0     0     0     0

Signed-off-by: Martin Andersson <martin.andersson@control.lth.se>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:43 -08:00
Linus Torvalds
9ae21d1bb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment
  Kconfig help: MTD_JEDECPROBE already supports Intel
  Remove ugly debugging stuff
  do_mounts.c: Minor ROOT_DEV comment cleanup
  BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c
  BUG_ON() Conversion in mm/mempool.c
  BUG_ON() Conversion in mm/memory.c
  BUG_ON() Conversion in kernel/fork.c
  BUG_ON() Conversion in ipc/sem.c
  BUG_ON() Conversion in fs/ext2/
  BUG_ON() Conversion in fs/hfs/
  BUG_ON() Conversion in fs/dcache.c
  BUG_ON() Conversion in fs/buffer.c
  BUG_ON() Conversion in input/serio/hp_sdc_mlc.c
  BUG_ON() Conversion in md/dm-table.c
  BUG_ON() Conversion in md/dm-path-selector.c
  BUG_ON() Conversion in drivers/isdn
  BUG_ON() Conversion in drivers/char
  BUG_ON() Conversion in drivers/mtd/
2006-03-26 09:41:18 -08:00
bibo mao
c6fd91f0bd [PATCH] kretprobe instance recycled by parent process
When kretprobe probes the schedule() function, if the probed process exits
then schedule() will never return, so some kretprobe instances will never
be recycled.

In this patch the parent process will recycle retprobe instances of the
probed function and there will be no memory leak of kretprobe instances.

Signed-off-by: bibo mao <bibo.mao@intel.com>
Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:04 -08:00
Roman Zippel
05cfb614dd [PATCH] hrtimers: remove data field
The nanosleep cleanup allows to remove the data field of hrtimer.  The
callback function can use container_of() to get it's own data.  Since the
hrtimer structure is anyway embedded in other structures, this adds no
overhead.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:03 -08:00
Roman Zippel
df869b630d [PATCH] hrtimers: remove nsec_t typedef
nsec_t predates ktime_t and has mostly been superseded by it.  In the few
places that are left it's better to make it explicit that we're dealing with
64 bit values here.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:03 -08:00
Roman Zippel
b75f7a51ca [PATCH] hrtimers: remove state field
Remove the state field and encode this information in the rb_node similiar to
normal timer.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Roman Zippel
432569bb9d [PATCH] hrtimers: simplify nanosleep
nanosleep is the only user of the expired state, so let it manage this itself,
which makes the hrtimer code a bit simpler.  The remaining time is also only
calculated if requested.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Roman Zippel
3b98a53281 [PATCH] hrtimers: posix-timer: cleanup common_timer_get()
Cleanup common_timer_get() a little.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Roman Zippel
44f2147551 [PATCH] hrtimers: pass current time to hrtimer_forward()
Pass current time to hrtimer_forward().  This allows to use the softirq time
in the timer base when the forward function is called from the timer callback.
 Other places pass current time with a call to timer->base->get_time().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Thomas Gleixner
92127c7a45 [PATCH] hrtimers: optimize softirq runqueues
The hrtimer softirq is called from the timer softirq every tick.  Retrieve the
current time from xtime and wall_to_monotonic instead of calling
base->get_time() for each timer base.  Store the time in the base structure
and provide a hook once clock source abstractions are in place and to keep the
code open for new base clocks.

Based on a patch from: Roman Zippel <zippel@linux-m68k.org>

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Stephen Rothwell
3158e9411a [PATCH] consolidate sys32/compat_adjtimex
Create compat_sys_adjtimex and use it an all appropriate places.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:57 -08:00
Con Kolivas
e655a250d5 [PATCH] swswsup: return correct load_image error
If there's an error in load_image() we should return that without checking
snapshot_image_loaded.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:55 -08:00
Ingo Molnar
cd7b24bb18 [PATCH] warn if free_irq() is called from IRQ context
Warn if free_irq() is called in IRQ context - free_irq() can execute /proc
VFS work, which must not be done in IRQ context.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:53 -08:00
Eric Sesterhenn
910dea7fdd BUG_ON() Conversion in kernel/fork.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-26 18:29:26 +02:00
Linus Torvalds
1b9a391736 Merge branch 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (22 commits)
  [PATCH] fix audit_init failure path
  [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format
  [PATCH] sem2mutex: audit_netlink_sem
  [PATCH] simplify audit_free() locking
  [PATCH] Fix audit operators
  [PATCH] promiscuous mode
  [PATCH] Add tty to syscall audit records
  [PATCH] add/remove rule update
  [PATCH] audit string fields interface + consumer
  [PATCH] SE Linux audit events
  [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c
  [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL
  [PATCH] Fix IA64 success/failure indication in syscall auditing.
  [PATCH] Miscellaneous bug and warning fixes
  [PATCH] Capture selinux subject/object context information.
  [PATCH] Exclude messages by message type
  [PATCH] Collect more inode information during syscall processing.
  [PATCH] Pass dentry, not just name, in fsnotify creation hooks.
  [PATCH] Define new range of userspace messages.
  [PATCH] Filter rule comparators
  ...

Fixed trivial conflict in security/selinux/hooks.c
2006-03-25 09:24:53 -08:00
Linus Torvalds
1e8c573933 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (21 commits)
  BUG_ON() Conversion in drivers/video/
  BUG_ON() Conversion in drivers/parisc/
  BUG_ON() Conversion in drivers/block/
  BUG_ON() Conversion in sound/sparc/cs4231.c
  BUG_ON() Conversion in drivers/s390/block/dasd.c
  BUG_ON() Conversion in lib/swiotlb.c
  BUG_ON() Conversion in kernel/cpu.c
  BUG_ON() Conversion in ipc/msg.c
  BUG_ON() Conversion in block/elevator.c
  BUG_ON() Conversion in fs/coda/
  BUG_ON() Conversion in fs/binfmt_elf_fdpic.c
  BUG_ON() Conversion in input/serio/hil_mlc.c
  BUG_ON() Conversion in md/dm-hw-handler.c
  BUG_ON() Conversion in md/bitmap.c
  The comment describing how MS_ASYNC works in msync.c is confusing
  rcu: undeclared variable used in documentation
  fix typos "wich" -> "which"
  typo patch for fs/ufs/super.c
  Fix simple typos
  tabify drivers/char/Makefile
  ...
2006-03-25 08:41:09 -08:00
Roman Zippel
5ddcfa878d [PATCH] remove pps support
This removes the support for pps.  It's completely unused within the kernel
and is basically in the way for further cleanups.  It should be easier to
readd proper support for it after the rest has been converted to NTP4
(where the pps mechanisms are quite different from NTP3 anyway).

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Adrian Bunk <bunk@stusta.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:02 -08:00
Dimitri Sivanich
f516342745 [PATCH] Add SA_PERCPU_IRQ flag support
Add support for SA_PERCPU_IRQ (only mmtimer.c uses this at this stage).

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:01 -08:00
Eric Dumazet
d74beb9f33 [PATCH] Use unsigned int types for a faster bsearch
This patch avoids arithmetic on 'signed' types that are slower than
'unsigned'.  This saves space and cpu cycles.

size of kernel/sys.o before the patch (gcc-3.4.5)

    text    data     bss     dec     hex filename
   10924     252       4   11180    2bac kernel/sys.o

size of kernel/sys.o after the patch
    text    data     bss     dec     hex filename
   10903     252       4   11159    2b97 kernel/sys.o

I noticed that gcc-4.1.0 (from Fedora Core 5) even uses idiv instruction for
(a+b)/2 if a and b are signed.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:01 -08:00
Ashok Raj
34f361ade2 [PATCH] Check if cpu can be onlined before calling smp_prepare_cpu()
- Moved check for online cpu out of smp_prepare_cpu()

- Moved default declaration of smp_prepare_cpu() to kernel/cpu.c

- Removed lock_cpu_hotplug() from smp_prepare_cpu() to around it, since
  its called from cpu_up() as well now.

- Removed clearing from cpu_present_map during cpu_offline as it breaks
  using cpu_up() directly during a subsequent online operation.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:01 -08:00
Eric Dumazet
231bed2058 [PATCH] No need to protect current->group_info in sys_getgroups(), in_group_p() and in_egroup_p()
While doing some benchmarks of an Apache/PHP SMP server, I noticed high
oprofile numbers in in_group_p() and _atomic_dec_and_lock().

rank  percent
  1     4.8911 % __link_path_walk
  2     4.8503 % __d_lookup
*3     4.2911 % _atomic_dec_and_lock
  4     3.9307 % __copy_to_user_ll
  5     4.9004 % sysenter_past_esp
*6     3.3248 % in_group_p

It appears that in_group_p() does an uncessary

get_group_info(current->group_info); /* atomic_inc() */
  ... /* access current->group_info */
put_group_info(current->group_info); /* _atomic_dec_and_lock */

It is not necessary to do this, because the current task holds a reference
on its own group_info, and this reference cannot change during the lookup.

This patch deletes the get_group_info()/put_group_info() pair from
sys_getgroups(), in_group_p() and in_egroup_p() functions.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Tim Hockin <thockin@hockin.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:58 -08:00
Andrew Morton
05eeae208d [PATCH] find_task_by_pid() needs tasklist_lock
A couple of places are forgetting to take it.

The kswapd case is probably unimportant.  keventd_create_kthread() was racy.

The whole thing is a bit flakey: you start a kernel thread, get its pid from
kernel_thread() then look up its task_struct.

a) It assumes that pid recycling takes a "long" time.

b) We get a task_struct but no reference was taken on it.  The owner of the
   kswapd and kthread task_struct*'s must assume that the new thread won't
   exit unexpectedly.  Because if it does, they're left holding dead memory
   and any attempt to control or stop that task will crash.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:57 -08:00
Chris Wright
12b5989be1 [PATCH] refactor capable() to one implementation, add __capable() helper
Move capable() to kernel/capability.c and eliminate duplicate
implementations.  Add __capable() function which can be used to check for
capabiilty of any process.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:56 -08:00
Bryan Holty
501f2499b8 [PATCH] IRQ: prevent enabling of previously disabled interrupt
This fix prevents re-disabling and enabling of a previously disabled
interrupt.  On an SMP system with irq balancing enabled; If an interrupt is
disabled from within its own interrupt context with disable_irq_nosync and is
also earmarked for processor migration, the interrupt is blindly moved to the
other processor and enabled without regard for its current "enabled" state.
If there is an interrupt pending, it will unexpectedly invoke the irq handler
on the new irq owning processor (even though the irq was previously disabled)

The more intuitive fix would be to invoke disable_irq_nosync and
enable_irq, but since we already have the desc->lock from __do_IRQ, we
cannot call them directly.  Instead we can use the same logic to disable
and enable found in disable_irq_nosync and enable_irq, with regards to the
desc->depth.

This now prevents a disabled interrupt from being re-disabled, and more
importantly prevents a disabled interrupt from being incorrectly enabled on
a different processor.

Signed-off-by: Bryan Holty <lgeek@frontiernet.net>
Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:55 -08:00
Andrew Morton
c777ac5594 [PATCH] irq: uninline migration functions
Uninline some massive IRQ migration functions.  Put them in the new
kernel/irq/migration.c.

Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:55 -08:00
Adrian Bunk
9871728b75 [PATCH] kernel/params.c: make param_array() static
param_array() in kernel/params.c can now become static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:52 -08:00
Rusty Russell
8d3b33f67f [PATCH] Remove MODULE_PARM
MODULE_PARM was actually breaking: recent gcc version optimize them out as
unused.  It's time to replace the last users, which are generally in the
most unloved drivers anyway.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:52 -08:00
Thomas Gleixner
7d99b7d634 [PATCH] Validate and sanitze itimer timeval from userspace
According to the specification the timevals must be validated and an
errorcode -EINVAL returned in case the timevals are not in canonical form.
This check was never done in Linux.

The pre 2.6.16 code converted invalid timevals silently.  Negative timeouts
were converted by the timeval_to_jiffies conversion to the maximum timeout.

hrtimers and the ktime_t operations expect timevals in canonical form.
Otherwise random results might happen on 32 bits machines due to the
optimized ktime_add/sub operations.  Negative timeouts are treated as
already expired.  This might break applications which work on pre 2.6.16.

To prevent random behaviour and API breakage the timevals are checked and
invalid timevals sanitized in a simliar way as the pre 2.6.16 code did.

Invalid timevals are reported with a per boot limited number of kernel
messages so applications which use this misfeature can be corrected.

After a grace period of one year the sanitizing should be replaced by a
correct validation check.  This is also documented in
Documentation/feature-removal-schedule.txt

The validation and sanitizing is done inside do_setitimer so all callers
(sys_setitimer, compat_sys_setitimer, osf_setitimer) are catched.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:49 -08:00
Thomas Gleixner
c08b8a4910 [PATCH] sys_alarm() unsigned signed conversion fixup
alarm() calls the kernel with an unsigend int timeout in seconds.  The
value is stored in the tv_sec field of a struct timeval to setup the
itimer.  The tv_sec field of struct timeval is of type long, which causes
the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX.

Before the hrtimer merge (pre 2.6.16) such a negative value was converted
to the maximum jiffies timeout by the timeval_to_jiffies conversion.  It's
not clear whether this was intended or just happened to be done by the
timeval_to_jiffies code.

hrtimers expect a timeval in canonical form and treat a negative timeout as
already expired.  This breaks the legitimate usage of alarm() with a
timeout value > INT_MAX seconds.

For 32 bit machines it is therefor necessary to limit the internal seconds
value to avoid API breakage.  Instead of doing this in all implementations
of sys_alarm the duplicated sys_alarm code is moved into a common function
in itimer.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:48 -08:00
Andrew Morton
185ae6d7a3 [PATCH] timer irq driven soft watchdog fix
I seem to have lost this hunk in yesterday's patch.  It brings the
coming-online CPU's softlockup timer up to date so we don't get false-positive
tripups during CPU hot-add.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:48 -08:00
Eric Sesterhenn
6978c7052f BUG_ON() Conversion in kernel/cpu.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-24 18:45:21 +01:00
Andrew Morton
f125b56113 [PATCH] fix build error if CONFIG_SYSFS=n
uevent_seqnum and uevent_helper are only defined if CONFIG_HOTPLUG=y,
CONFIG_NET=n.

(I stole this back from Greg's tree - it makes allnoconfig work).

Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:31 -08:00
Davi Arnaut
24277dda3a [PATCH] strndup_user: convert module
Change hand-coded userspace string copying to strndup_user.

Signed-off-by: Davi Arnaut <davi.arnaut@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:31 -08:00
Ingo Molnar
6687a97d40 [PATCH] timer-irq-driven soft-watchdog, cleanups
Make the softlockup detector purely timer-interrupt driven, removing
softirq-context (timer) dependencies.  This means that if the softlockup
watchdog triggers, it has truly observed a longer than 10 seconds
scheduling delay of a SCHED_FIFO prio 99 task.

(the patch also turns off the softlockup detector during the initial bootup
phase and does small style fixes)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Sergey Vlasov
6a4d11c2ab [PATCH] Fix module refcount leak in __set_personality()
If the change of personality does not lead to change of exec domain,
__set_personality() returned without releasing the module reference
acquired by lookup_exec_domain().

Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Andrew Morton
d3561f78fd [PATCH] RLIMIT_CPU: document wrong return value
Document the fact that setrlimit(RLIMIT_CPU) doesn't return error codes when
it should.  I don't think we can fix this without a 2.7.x..

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Andrew Morton
e0661111e5 [PATCH] RLIMIT_CPU: fix handling of a zero limit
At present the kernel doesn't honour an attempt to set RLIMIT_CPU to zero
seconds.  But the spec says it should, and that's what 2.4.x does.

Fixing this for real would involve some complexity (such as adding a new
it-has-been-set flag to the task_struct, and testing that everwhere, instead
of overloading the value of it_prof_expires).

Given that a 2.4 kernel won't actually send the signal until one second has
expired anyway, let's just handle this case by treating the caller's
zero-seconds as one second.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Andrew Morton
ec9e16bacd [PATCH] sys_setrlimit() cleanup
- Whitespace cleanups

- Make that expression comprehensible.

There's a potential logic change here: we do the "is it_prof_expires equal to
zero" test after converting it to seconds, rather than doing the comparison
between raw cputime_t's.

But given that it's in units of seconds anyway, that shouldn't change
anything.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
John Z. Bohach
2ea1c5392c [PATCH] console_setup() depends (wrongly?) on CONFIG_PRINTK
It appears that console_setup() code only gets compiled into the kernel if
CONFIG_PRINTK is enabled.  One detrimental side-effect of this is that
serial8250_console_setup() never gets invoked when CONFIG_PRINTK is not
set, resulting in baud rate not being read/parsed from command line (i.e.
console=ttyS0,115200n8 is ignored, at least the baud rate part...)

Attached patch moves console_setup() code from inside

#ifdef CONFIG_PRINTK

to outside (in printk.c), removing dependence on said config. option.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:27 -08:00
Paul Jackson
29afd49b72 [PATCH] cpuset: remove useless local variable initialization
Remove a useless variable initialization in cpuset __cpuset_zone_allowed().
 The local variable 'allowed' is unconditionally set before use, later on
in the code, so does not need to be initialized.

Not that it seems to matter to the code generated any, as the compiler
optimizes out the superfluous assignment anyway.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:24 -08:00
Paul Jackson
151a44202d [PATCH] cpuset: don't need to mark cpuset_mems_generation atomic
Drop the atomic_t marking on the cpuset static global
cpuset_mems_generation.  Since all access to it is guarded by the global
manage_mutex, there is no need for further serialization of this value.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:24 -08:00
Paul Jackson
8488bc359d [PATCH] cpuset: remove unnecessary NULL check
Remove a no longer needed test for NULL cpuset pointer, with a little
comment explaining why the test isn't needed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:23 -08:00
Paul Jackson
c61afb181c [PATCH] cpuset memory spread slab cache optimizations
The hooks in the slab cache allocator code path for support of NUMA
mempolicies and cpuset memory spreading are in an important code path.  Many
systems will use neither feature.

This patch optimizes those hooks down to a single check of some bits in the
current tasks task_struct flags.  For non NUMA systems, this hook and related
code is already ifdef'd out.

The optimization is done by using another task flag, set if the task is using
a non-default NUMA mempolicy.  Taking this flag bit along with the
PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
memory spreading' patch set, one can check for the combination of any of these
special case memory placement mechanisms with a single test of the current
tasks task_struct flags.

This patch also tightens up the code, to save a few bytes of kernel text
space, and moves some of it out of line.  Due to the nested inlines called
from multiple places, we were ending up with three copies of this code, which
once we get off the main code path (for local node allocation) seems a bit
wasteful of instruction memory.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:23 -08:00
Paul Jackson
825a46af5a [PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).

The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.

All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.

There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures.  They are called 'memory_spread_page' and 'memory_spread_slab'.

If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.

If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.

The implementation is simple.  Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset.  In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.

The cpuset_mem_spread_node() routine is also simple.  It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.

This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit.  Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.

A couple of Copyright year ranges are updated as well.  And a couple of email
addresses that can be found in the MAINTAINERS file are removed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Paul Jackson
8a39cc60bf [PATCH] cpuset use combined atomic_inc_return calls
Replace pairs of calls to <atomic_inc, atomic_read>, with a single call
atomic_inc_return, saving a few bytes of source and kernel text.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Paul Jackson
7b5b9ef0e1 [PATCH] cpuset cleanup not not operators
Since the test_bit() bit operator is boolean (return 0 or 1), the double not
"!!" operations needed to convert a scalar (zero or not zero) to a boolean are
not needed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Paul E. McKenney
95c3832272 [PATCH] rcutorture: tag success/failure line with module parameters
A long-running rcutorture test can overflow dmesg, so that the line
containing the module parameters is lost.  Although it is usually possible
to retrieve this information from the log files, it is much better to just
tag it onto the final success/failure line so that it may be easily found.
This patch does just that.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Jan Beulich
a4a6198b80 [PATCH] tvec_bases too large for per-cpu data
With internal Xen-enabled kernels we see the kernel's static per-cpu data
area exceed the limit of 32k on x86-64, and even native x86-64 kernels get
fairly close to that limit.  I generally question whether it is reasonable
to have data structures several kb in size allocated as per-cpu data when
the space there is rather limited.

The biggest arch-independent consumer is tvec_bases (over 4k on 32-bit
archs, over 8k on 64-bit ones), which now gets converted to use dynamically
allocated memory instead.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:21 -08:00
Oleg Nesterov
caa9ee771d [PATCH] rcu_process_callbacks: don't cli() while testing ->nxtlist
__rcu_process_callbacks() disables interrupts to protect itself from
call_rcu() which adds new entries to ->nxtlist.

However we can check "->nxtlist != NULL" with interrupts enabled, we can't
get "false positives" because call_rcu() can only change this condition
from 0 to 1.

Tested with rcutorture.ko.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Bart Samwel
cba9f33d13 [PATCH] Range checking in do_proc_dointvec_(userhz_)jiffies_conv
When (integer) sysctl values are in either seconds or centiseconds, but
represented internally as jiffies, the allowable value range is decreased.
This patch adds range checks to the conversion routines.

For values in seconds: maximum LONG_MAX / HZ.

For values in centiseconds: maximum (LONG_MAX / HZ) * USER_HZ.

(BTW, does anyone else feel that an interface in seconds should not be
accepting negative values?)

Signed-off-by: Bart Samwel <bart@samwel.tk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Bart Samwel
ed5b43f15a [PATCH] Represent laptop_mode as jiffies internally
Make that the internal value for /proc/sys/vm/laptop_mode is stored as
jiffies instead of seconds.  Let the sysctl interface do the conversions,
instead of doing on-the-fly conversions every time the value is used.

Add a description of the fact that laptop_mode doubles as a flag and a
timeout to the comment above the laptop_mode variable.

Signed-off-by: Bart Samwel <bart@samwel.tk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Bart Samwel
f6ef943813 [PATCH] Represent dirty_*_centisecs as jiffies internally
Make that the internal values for:

/proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_expire_centisecs

are stored as jiffies instead of centiseconds.  Let the sysctl interface do
the conversions with full precision using clock_t_to_jiffies, instead of
doing overflow-sensitive on-the-fly conversions every time the values are
used.

Cons: apparent precision loss if HZ is not a multiple of 100, because of
conversion back and forth.  This is a common problem for all sysctl values
that use proc_dointvec_userhz_jiffies.  (There is only one other in-tree
use, in net/core/neighbour.c.)

Signed-off-by: Bart Samwel <bart@samwel.tk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Andrew Morton
36f574135e [PATCH] free_uid() locking improvement
Reduce lock hold times in free_uid().

Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Jens Axboe
2056a782f8 [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 20:00:26 +01:00
Tom Zanussi
6dac40a7ce [PATCH] relay: consolidate sendfile() and read() code
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 19:58:45 +01:00
Jens Axboe
221415d762 [PATCH] relay: add sendfile() support
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 19:57:55 +01:00
Jens Axboe
b86ff981a8 [PATCH] relay: migrate from relayfs to a generic relay API
Original patch from Paul Mundt, sysfs parts removed by me since they
were broken.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 19:56:55 +01:00
Adrian Bunk
2178426d26 [PATCH] kernel/rcupdate.c: make two structs static
This patch makes two needlessly global structs static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:17 -08:00
Oleg Nesterov
ee25e96fcd [PATCH] BUILD_LOCK_OPS: cleanup preempt_disable() usage
This patch changes the code from:

	preempt_disable();
	for (;;) {
		...
		preempt_disable();
	}
to:
	for (;;) {
		preempt_disable();
		...
	}

which seems more clean to me and saves a couple of bytes for
each function.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:16 -08:00
Andrew Morton
dd287796d6 [PATCH] pause_on_oops command line option
Attempt to fix the problem wherein people's oops reports scroll off the screen
due to repeated oopsing or to oopses on other CPUs.

If this happens the user can reboot with the `pause_on_oops=<seconds>' option.
It will allow the first oopsing CPU to print an oops record just a single
time.  Second oopsing attempts, or oopses on other CPUs will cause those CPUs
to enter a tight loop until the specified number of seconds have elapsed.

The patch implements the infrastructure generically in the expectation that
architectures other than x86 will find it useful.

Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:16 -08:00
Ingo Molnar
91368d73e4 [PATCH] make bug messages more consistent
Consolidate all kernel bug printouts to begin with the "BUG: " string.
Makes it easier to find them in large bootup logs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:16 -08:00
Oleg Nesterov
a26fd335b4 [PATCH] sigprocmask: kill unneeded temp var
Cleanup, remove unneeded double copying of current->blocked.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:15 -08:00
Ashutosh Naik
6389a38511 [PATCH] kernel/module.c Semaphore to Mutex Conversion for module_mutex
This patch converts the module_mutex semaphore to a mutex.

Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:14 -08:00
Ingo Molnar
7a7d1cf954 [PATCH] sem2mutex: kprobes
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:12 -08:00
Ingo Molnar
70522e121a [PATCH] sem2mutex: tty
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:11 -08:00
Arjan van de Ven
97d1f15b7e [PATCH] sem2mutex: kernel/
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:10 -08:00
Ingo Molnar
9331b3157c [PATCH] convert kernel/rcupdate.c:rcu_barrier_sema to mutex
Convert kernel/rcupdate's rcu_barrier_sema to mutex.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:10 -08:00
Ingo Molnar
3d3f26a7ba [PATCH] kernel/cpuset.c, mutex conversion
convert cpuset.c's callback_sem and manage_sem to mutexes.
Build and boot tested by Ingo.
Build, boot, unit and stress tested by pj.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:10 -08:00
Ravikiran G Thirumalai
2dd0ebcd2a [PATCH] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
Avoid taking the global tasklist_lock when possible, if a process is single
threaded during getrusage().  Any avoidance of tasklist_lock is good for
NUMA boxes (and possibly for large SMPs).  Thanks to Oleg Nesterov for
review and suggestions.

Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:09 -08:00
Eric Dumazet
0c9e63fd38 [PATCH] Shrinks sizeof(files_struct) and better layout
1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
   platforms, lowering kmalloc() allocated space by 50%.

2) Reduce the size of (files_struct), using a special 32 bits (or
   64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
   close_on_exec_init and open_fds_init fields.  This save some ram (248
   bytes per task) as most tasks dont open more than 32 files.  D-Cache
   footprint for such tasks is also reduced to the minimum.

3) Reduce size of allocated fdset.  Currently two full pages are
   allocated, that is 32768 bits on x86 for example, and way too much.  The
   minimum is now L1_CACHE_BYTES.

UP and SMP should benefit from this patch, because most tasks will touch
only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
(next_fd, close_on_exec_init, open_fds_init, fd_array[0 ..  2] being in the
same cache line)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:09 -08:00
Luca Tettamanti
9b238205ba [PATCH] swsusp: add s2ram ioctl to userland interface
Add the SNAPSHOT_S2RAM ioctl to the snapshot device.

This ioctl allows a userland application to make the system (previously frozen
with the SNAPSHOT_FREE ioctl) enter the S3 state without freezing processes
and disabling nonboot CPUs for the second time.

This will allow us to implement the suspend-to-disk-and-RAM (STDR)
functionality in the userland suspend tools.

Signed-off-by: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
94c188d329 [PATCH] swsusp: let userland tools switch console on suspend
Remove the console-switching code from the suspend part of the swsusp userland
interface and let the userland tools switch the console.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Shaohua Li
e4e4d66556 [PATCH] swsusp: drain high mem pages
Highmem could be in pcp list as well.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
fc558a7496 [PATCH] swsusp: finally solve mysqld problem
This patch from Pavel moves userland freeze signals handling into more logical
place.  It now hits even with mysqld running.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Pavel Machek
ce6ed29f31 [PATCH] suspend: make progress printing prettier
Combination of printk/pr_debug led to <7> in the middle of the line, and we
printed way too many dots.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
02aaeb9b95 [PATCH] swsusp: freeze user space processes first
Allow swsusp to freeze processes successfully under heavy load by freezing
userspace processes before kernel threads.

[Thanks to Nigel Cunningham <nigel@suspend2.net> for suggesting the
way to go.]

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
6e1819d615 [PATCH] swsusp: userland interface
This patch introduces a user space interface for swsusp.

The interface is based on a special character device, called the snapshot
device, that allows user space processes to perform suspend and resume-related
operations with the help of some ioctls and the read()/write() functions.
 Additionally it allows these processes to allocate free swap pages from a
selected swap partition, called the resume partition, so that they know which
sectors of the resume partition are available to them.

The interface uses the same low-level system memory snapshot-handling
functions that are used by the built-it swap-writing/reading code of swsusp.

The interface documentation is included in the patch.

The patch assumes that the major and minor numbers of the snapshot device will
be 10 (ie.  misc device) and 231, the registration of which has already been
requested.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Pavel Machek
543cc27d09 [PATCH] swsusp: documentation updates
Update suspend-to-RAM documentation with new machines, and makes message
when processes can't be stopped little clearer.  (In one case, waiting
longer actually did help).

From: "Rafael J. Wysocki" <rjw@sisk.pl>

  Warn in the documentation that data may be lost if there are some
  filesystems mounted from USB devices before suspend.

  [Thanks to Alan Stern for providing the answer to the question in the
  Q:-A: part.]

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Randy Dunlap
74c7e2efbe [PATCH] kernel/power: move externs to header files
Move externs from C source files to header files.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Rafael J. Wysocki
61159a314b [PATCH] swsusp: separate swap-writing/reading code
Move the swap-writing/reading code of swsusp to a separate file.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Rafael J. Wysocki
f577eb30af [PATCH] swsusp: low level interface
Introduce the low level interface that can be used for handling the
snapshot of the system memory by the in-kernel swap-writing/reading code of
swsusp and the userland interface code (to be introduced shortly).

Also change the way in which swsusp records the allocated swap pages and,
consequently, simplifies the in-kernel swap-writing/reading code (this is
necessary for the userland interface too).  To this end, it introduces two
helper functions in mm/swapfile.c, so that the swsusp code does not refer
directly to the swap internals.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Andrew Morton
2b322ce210 [PATCH] revert "swsusp: fix breakage with swap on lvm"
This was a temporary thing for 2.6.16.

Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Anton Blanchard
e9028b0ff2 [PATCH] fix scheduler deadlock
We have noticed lockups during boot when stress testing kexec on ppc64.
Two cpus would deadlock in scheduler code trying to grab already taken
spinlocks.

The double_rq_lock code uses the address of the runqueue to order the
taking of multiple locks.  This address is a per cpu variable:

	if (rq1 < rq2) {
		spin_lock(&rq1->lock);
		spin_lock(&rq2->lock);
	} else {
		spin_lock(&rq2->lock);
		spin_lock(&rq1->lock);
	}

On the other hand, the code in wake_sleeping_dependent uses the cpu id
order to grab locks:

	for_each_cpu_mask(i, sibling_map)
		spin_lock(&cpu_rq(i)->lock);

This means we rely on the address of per cpu data increasing as cpu ids
increase.  While this will be true for the generic percpu implementation it
may not be true for arch specific implementations.

One way to solve this is to always take runqueues in cpu id order. To do
this we add a cpu variable to the runqueue and check it in the
double runqueue locking functions.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:02 -08:00
Linus Torvalds
2e6e33bab6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (78 commits)
  [PATCH] powerpc: Add FSL SEC node to documentation
  [PATCH] macintosh: tidy-up driver_register() return values
  [PATCH] powerpc: tidy-up of_register_driver()/driver_register() return values
  [PATCH] powerpc: via-pmu warning fix
  [PATCH] macintosh: cleanup the use of i2c headers
  [PATCH] powerpc: dont allow old RTC to be selected
  [PATCH] powerpc: make powerbook_sleep_grackle static
  [PATCH] powerpc: Fix warning in add_memory
  [PATCH] powerpc: update mailing list addresses
  [PATCH] powerpc: Remove calculation of io hole
  [PATCH] powerpc: iseries: Add bootargs to /chosen
  [PATCH] powerpc: iseries: Add /system-id, /model and /compatible
  [PATCH] powerpc: Add strne2a() to convert a string from EBCDIC to ASCII
  [PATCH] powerpc: iseries: Make more stuff static in platforms/iseries/mf.c
  [PATCH] powerpc: iseries: Remove pointless iSeries_(restart|power_off|halt)
  [PATCH] powerpc: iseries: mf related cleanups
  [PATCH] powerpc: Replace platform_is_lpar() with a firmware feature
  [PATCH] powerpc: trivial: Cleanup whitespace in cputable.h
  [PATCH] powerpc: Remove unused iommu_off logic from pSeries_init_early()
  [PATCH] powerpc: Unconfuse htab_bolt_mapping() callers
  ...
2006-03-22 22:20:46 -08:00
Linus Torvalds
2152f85366 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (138 commits)
  [SCSI] libata: implement minimal transport template for ->eh_timed_out
  [SCSI] eliminate rphy allocation in favour of expander/end device allocation
  [SCSI] convert mptsas over to end_device/expander allocations
  [SCSI] allow displaying and setting of cache type via sysfs
  [SCSI] add scsi_mode_select to scsi_lib.c
  [SCSI] 3ware 9000 add big endian support
  [SCSI] qla2xxx: update MAINTAINERS
  [SCSI] scsi: move target_destroy call
  [SCSI] fusion - bump version
  [SCSI] fusion - expander hotplug suport in mptsas module
  [SCSI] fusion - exposing raid components in mptsas
  [SCSI] fusion - memory leak, and initializing fields
  [SCSI] fusion - exclosure misspelled
  [SCSI] fusion - cleanup mptsas event handling functions
  [SCSI] fusion - removing target_id/bus_id from the VirtDevice structure
  [SCSI] fusion - static fix's
  [SCSI] fusion - move some debug firmware event debug msgs to verbose level
  [SCSI] fusion - loginfo header update
  [SCSI] add scsi_reprobe_device
  [SCSI] megaraid_sas: fix extended timeout handling
  ...
2006-03-22 10:47:24 -08:00
Andrew Morton
78eef01b0f [PATCH] on_each_cpu(): disable local interrupts
When on_each_cpu() runs the callback on other CPUs, it runs with local
interrupts disabled.  So we should run the function with local interrupts
disabled on this CPU, too.

And do the same for UP, so the callback is run in the same environment on both
UP and SMP.  (strictly it should do preempt_disable() too, but I think
local_irq_disable is sufficiently equivalent).

Also uninlines on_each_cpu().  softirq.c was the most appropriate file I could
find, but it doesn't seem to justify creating a new file.

Oh, and fix up that comment over (under?) x86's smp_call_function().  It
drives me nuts.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:59 -08:00
Eric W. Biederman
06f9d4f94a [PATCH] unshare: Error if passed unsupported flags
A bare bones trivial patch to ensure we always get -EINVAL on the
unsupported cases for sys_unshare.  If this goes in before 2.6.16 it allows
us to forward compatible with future applications using sys_unshare.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: JANAK DESAI <janak@us.ibm.com>
Cc: <stable@kerenl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:55 -08:00
Mike Galbraith
9430d58e34 [PATCH] sched: remove sleep_avg multiplier
Remove the sleep_avg multiplier.  This multiplier was necessary back when
we had 10 seconds of dynamic range in sleep_avg, but now that we only have
one second, it causes that one second to be compressed down to 100ms in
some cases.  This is particularly noticeable when compiling a kernel in a
slow NFS mount, and I believe it to be a very likely candidate for other
recently reported network related interactivity problems.

In testing, I can detect no negative impact of this removal.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:54 -08:00
James Bottomley
d04cdb6421 Merge ../linux-2.6 2006-03-21 13:05:45 -06:00
Greg Kroah-Hartman
03e88ae1b1 [PATCH] fix module sysfs files reference counting
The module files, refcnt, version, and srcversion did not properly
increment the owner's module reference count, allowing the modules to
be removed while the files were open, causing oopses.

This patch fixes this, and also fixes the problem that the version and
srcversion files were not showing up, unless CONFIG_MODULE_UNLOAD was
enabled, which is not correct.

Cc: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Greg Kroah-Hartman
01ca70dca5 [PATCH] add EXPORT_SYMBOL_GPL_FUTURE() to RCU subsystem
As the RCU symbols are going to be changed to GPL in the near future,
lets warn users that this is going to happen.

Cc: Paul McKenney <paulmck@us.ibm.com>
Acked-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Greg Kroah-Hartman
9f28bb7e1d [PATCH] add EXPORT_SYMBOL_GPL_FUTURE()
This patch adds the ability to mark symbols that will be changed in the
future, so that kernel modules that don't include MODULE_LICENSE("GPL")
and use the symbols, will be flagged and printed out to the system log.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Sam Ravnborg
3fd6805f4d [PATCH] Clean up module.c symbol searching logic
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Jun'ichi Nomura
51107301b6 [PATCH] kobject: fix build error if CONFIG_SYSFS=n
Moving uevent_seqnum and uevent_helper to kobject_uevent.c
because they are used even if CONFIG_SYSFS=n
while kernel/ksysfs.c is built only if CONFIG_SYSFS=y,

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:57 -08:00
Amy Griffis
71e1c784b2 [PATCH] fix audit_init failure path
Make audit_init() failure path handle situations where the audit_panic()
action is not AUDIT_FAIL_PANIC (default is AUDIT_FAIL_PRINTK).  Other uses
of audit_sock are not reached unless audit's netlink message handler is
properly registered.  Bug noticed by Peter Staubach.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
lorenzo@gnu.org
bf45da97a4 [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format
Hi,

This is a trivial patch that enables the possibility of using some auditing
functions within loadable kernel modules (ie. inside a Linux Security Module).

_

Make the audit_log_start, audit_log_end, audit_format and audit_log
interfaces available to Loadable Kernel Modules, thus making possible
the usage of the audit framework inside LSMs, etc.

Signed-off-by: <Lorenzo Hernández García-Hierro <lorenzo@gnu.org>>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Ingo Molnar
5a0bbce58b [PATCH] sem2mutex: audit_netlink_sem
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Ingo Molnar
4023e02080 [PATCH] simplify audit_free() locking
Simplify audit_free()'s locking: no need to lock a task that we are tearing
down.  [the extra locking also caused false positives in the lock
validator]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Dustin Kirkland
d9d9ec6e2c [PATCH] Fix audit operators
Darrel Goeddel initiated a discussion on IRC regarding the possibility
of audit_comparator() returning -EINVAL signaling an invalid operator.

It is possible when creating the rule to assure that the operator is one
of the 6 sane values.  Here's a snip from include/linux/audit.h  Note
that 0 (nonsense) and 7 (all operators) are not valid values for an
operator.

...

/* These are the supported operators.
 *      4  2  1
 *      =  >  <
 *      -------
 *      0  0  0         0       nonsense
 *      0  0  1         1       <
 *      0  1  0         2       >
 *      0  1  1         3       !=
 *      1  0  0         4       =
 *      1  0  1         5       <=
 *      1  1  0         6       >=
 *      1  1  1         7       all operators
 */
...

Furthermore, prior to adding these extended operators, flagging the
AUDIT_NEGATE bit implied !=, and otherwise == was assumed.

The following code forces the operator to be != if the AUDIT_NEGATE bit
was flipped on.  And if no operator was specified, == is assumed.  The
only invalid condition is if the AUDIT_NEGATE bit is off and all of the
AUDIT_EQUAL, AUDIT_LESS_THAN, and AUDIT_GREATER_THAN bits are
on--clearly a nonsensical operator.

Now that this is handled at rule insertion time, the default -EINVAL
return of audit_comparator() is eliminated such that the function can
only return 1 or 0.

If this is acceptable, let's get this applied to the current tree.

:-Dustin

--

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from 9bf0a8e137040f87d1b563336d4194e38fb2ba1a commit)
2006-03-20 14:08:55 -05:00
Steve Grubb
a6c043a887 [PATCH] Add tty to syscall audit records
Hi,

>From the RBAC specs:

FAU_SAR.1.1 The TSF shall provide the set of authorized
RBAC administrators with the capability to read the following
audit information from the audit records:

<snip>
(e) The User Session Identifier or Terminal Type

A patch adding the tty for all syscalls is included in this email.
Please apply.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Steve Grubb
5d3301088f [PATCH] add/remove rule update
Hi,

The following patch adds a little more information to the add/remove rule message emitted
by the kernel.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Amy Griffis
93315ed6dd [PATCH] audit string fields interface + consumer
Updated patch to dynamically allocate audit rule fields in kernel's
internal representation.  Added unlikely() calls for testing memory
allocation result.

Amy Griffis wrote:     [Wed Jan 11 2006, 02:02:31PM EST]
> Modify audit's kernel-userspace interface to allow the specification
> of string fields in audit rules.
>
> Signed-off-by: Amy Griffis <amy.griffis@hp.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from 5ffc4a863f92351b720fe3e9c5cd647accff9e03 commit)
2006-03-20 14:08:54 -05:00
David Woodhouse
d884596f44 [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:54 -05:00
David Woodhouse
fe7752bab2 [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL
This fixes the per-user and per-message-type filtering when syscall
auditing isn't enabled.

[AV: folded followup fix from the same author]

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:54 -05:00
Dustin Kirkland
7306a0b9b3 [PATCH] Miscellaneous bug and warning fixes
This patch fixes a couple of bugs revealed in new features recently
added to -mm1:
* fixes warnings due to inconsistent use of const struct inode *inode
* fixes bug that prevent a kernel from booting with audit on, and SELinux off
  due to a missing function in security/dummy.c
* fixes a bug that throws spurious audit_panic() messages due to a missing
  return just before an error_path label
* some reasonable house cleaning in audit_ipc_context(),
  audit_inode_context(), and audit_log_task_context()

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:54 -05:00
Dustin Kirkland
8c8570fb8f [PATCH] Capture selinux subject/object context information.
This patch extends existing audit records with subject/object context
information. Audit records associated with filesystem inodes, ipc, and
tasks now contain SELinux label information in the field "subj" if the
item is performing the action, or in "obj" if the item is the receiver
of an action.

These labels are collected via hooks in SELinux and appended to the
appropriate record in the audit code.

This additional information is required for Common Criteria Labeled
Security Protection Profile (LSPP).

[AV: fixed kmalloc flags use]
[folded leak fixes]
[folded cleanup from akpm (kfree(NULL)]
[folded audit_inode_context() leak fix]
[folded akpm's fix for audit_ipc_perm() definition in case of !CONFIG_AUDIT]

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:54 -05:00
Dustin Kirkland
c8edc80c8b [PATCH] Exclude messages by message type
- Add a new, 5th filter called "exclude".
    - And add a new field AUDIT_MSGTYPE.
    - Define a new function audit_filter_exclude() that takes a message type
      as input and examines all rules in the filter.  It returns '1' if the
      message is to be excluded, and '0' otherwise.
    - Call the audit_filter_exclude() function near the top of
      audit_log_start() just after asserting audit_initialized.  If the
      message type is not to be audited, return NULL very early, before
      doing a lot of work.
[combined with followup fix for bug in original patch, Nov 4, same author]
[combined with later renaming AUDIT_FILTER_EXCLUDE->AUDIT_FILTER_TYPE
and audit_filter_exclude() -> audit_filter_type()]

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:54 -05:00
Amy Griffis
73241ccca0 [PATCH] Collect more inode information during syscall processing.
This patch augments the collection of inode info during syscall
processing. It represents part of the functionality that was provided
by the auditfs patch included in RHEL4.

Specifically, it:

- Collects information for target inodes created or removed during
  syscalls.  Previous code only collects information for the target
  inode's parent.

- Adds the audit_inode() hook to syscalls that operate on a file
  descriptor (e.g. fchown), enabling audit to do inode filtering for
  these calls.

- Modifies filtering code to check audit context for either an inode #
  or a parent inode # matching a given rule.

- Modifies logging to provide inode # for both parent and child.

- Protect debug info from NULL audit_names.name.

[AV: folded a later typo fix from the same author]

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:53 -05:00
Amy Griffis
f38aa94224 [PATCH] Pass dentry, not just name, in fsnotify creation hooks.
The audit hooks (to be added shortly) will want to see dentry->d_inode
too, not just the name.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Steve Grubb
90d526c074 [PATCH] Define new range of userspace messages.
The attached patch updates various items for the new user space
messages. Please apply.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Dustin Kirkland
b63862f465 [PATCH] Filter rule comparators
Currently, audit only supports the "=" and "!=" operators in the -F
filter rules.

This patch reworks the support for "=" and "!=", and adds support
for ">", ">=", "<", and "<=".

This turned out to be a pretty clean, and simply process.  I ended up
using the high order bits of the "field", as suggested by Steve and Amy.
This allowed for no changes whatsoever to the netlink communications.
See the documentation within the patch in the include/linux/audit.h
area, where there is a table that explains the reasoning of the bitmask
assignments clearly.

The patch adds a new function, audit_comparator(left, op, right).
This function will perform the specified comparison (op, which defaults
to "==" for backward compatibility) between two values (left and right).
If the negate bit is on, it will negate whatever that result was.  This
value is returned.

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Randy Dunlap
b0dd25a826 [PATCH] AUDIT: kerneldoc for kernel/audit*.c
- add kerneldoc for non-static functions;
- don't init static data to 0;
- limit lines to < 80 columns;
- fix long-format style;
- delete whitespace at end of some lines;

(chrisw: resend and update to current audit-2.6 tree)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Jason Baron
7e7f8a036b [PATCH] make vm86 call audit_syscall_exit
hi,

The motivation behind the patch below was to address messages in
/var/log/messages such as:

Jan 31 10:54:15 mets kernel: audit(:0): major=252 name_count=0: freeing
multiple contexts (1)
Jan 31 10:54:15 mets kernel: audit(:0): major=113 name_count=0: freeing
multiple contexts (2)

I can reproduce by running 'get-edid' from:
http://john.fremlin.de/programs/linux/read-edid/.

These messages come about in the log b/c the vm86 calls do not exit via
the normal system call exit paths and thus do not call
'audit_syscall_exit'. The next system call will then free the context for
itself and for the vm86 context, thus generating the above messages. This
patch addresses the issue by simply adding a call to 'audit_syscall_exit'
from the vm86 code.

Besides fixing the above error messages the patch also now allows vm86
system calls to become auditable. This is useful since strace does not
appear to properly record the return values from sys_vm86.

I think this patch is also a step in the right direction in terms of
cleaning up some core auditing code. If we can correct any other paths
that do not properly call the audit exit and entries points, then we can
also eliminate the notion of context chaining.

I've tested this patch by verifying that the log messages no longer
appear, and that the audit records for sys_vm86 appear to be correct.
Also, 'read_edid' produces itentical output.

thanks,

-Jason

Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:53 -05:00
Al Viro
afc847b7dd [PATCH] don't do exit_io_context() until we know we won't be doing any IO
testcase:

mount /dev/sdb10 /mnt
touch /mnt/tmp/b
umount /mnt
mount /dev/sdb10 /mnt
rm /mnt/tmp/b </mnt/tmp/b
umount /mnt

and watch blkdev_ioc line in /proc/slabinfo.  Vanilla kernel leaks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-18 18:33:46 -05:00
Oleg Nesterov
2d61b86775 [PATCH] disable unshare(CLONE_VM) for now
sys_unshare() does mmput(new_mm).  This is not enough if we have
mm->core_waiters.

This patch is a temporary fix for soon to be released 2.6.16.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
[ Checked with Uli: "I'm not planning to use unshare(CLONE_VM).  It's
  not needed for any functionality planned so far.  What we (as in Red
  Hat) need unshare() for now is the filesystem side." ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-18 10:49:36 -08:00
Roman Zippel
a0a0c28c1a [PATCH] posix-timers: fix requeue accounting when signal is ignored
When the posix-timer signal is ignored then the timer is rearmed by the
callback function.  The requeue pending accounting has to be fixed up else
the state might be wrong.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17 07:51:25 -08:00
Christoph Lameter
67890d7084 [PATCH] time_interpolator: add __read_mostly
The pointer to the current time interpolator and the current list of time
interpolators are typically only changed during bootup.  Adding
__read_mostly takes them away from possibly hot cachelines.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17 07:51:25 -08:00
Eric W. Biederman
e0e8eb54d8 [PATCH] unshare: Use rcu_assign_pointer when setting sighand
The sighand pointer only needs the rcu_read_lock on the
read side.  So only depending on task_lock protection
when setting this pointer is not enough.  We also need
a memory barrier to ensure the initialization is seen first.

Use rcu_assign_pointer as it does this for us, and clearly
documents that we are setting an rcu readable pointer.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17 07:46:59 -08:00
Paul Mackerras
23dd640112 Merge ../linux-2.6 2006-03-17 12:01:19 +11:00
James Bottomley
f33b5d783b Merge ../linux-2.6 2006-03-14 14:18:01 -06:00
GOTO Masanori
f9a3879abf [PATCH] Fix sigaltstack corruption among cloned threads
This patch fixes alternate signal stack corruption among cloned threads
with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6.

The value of alternate signal stack is currently inherited after a call of
clone(...  CLONE_SIGHAND | CLONE_VM).  But if sigaltstack is set by a
parent thread, and then if multiple cloned child threads (+ parent threads)
call signal handler at the same time, some threads may be conflicted -
because they share to use the same alternative signal stack region.
Finally they get sigsegv.  It's an undesirable race condition.  Note that
child threads created from NPTL pthread_create() also hit this conflict
when the parent thread uses sigaltstack, without my patch.

To fix this problem, this patch clears the child threads' sigaltstack
information like exec().  This behavior follows the SUSv3 specification.
In SUSv3, pthread_create() says "The alternate stack shall not be inherited
(when new threads are initialized)".  It means that sigaltstack should be
cleared when sigaltstack memory space is shared by cloned threads with
CLONE_SIGHAND.

Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because:
  - If clone_flags line is not existed, fork() does not inherit sigaltstack.
  - CLONE_VM is another choice, but vfork() does not inherit sigaltstack.
  - CLONE_SIGHAND implies CLONE_VM, and it looks suitable.
  - CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM,
    but this flag has a bit different semantics.
I decided to use CLONE_SIGHAND.

[ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ]

Signed-off-by: GOTO Masanori <gotom@sanori.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Linus Torvalds <torvalds@osdl.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-14 07:57:17 -08:00
Christoph Hellwig
7cd9013be6 [PATCH] remove __put_task_struct_cb export again
The patch '[PATCH] RCU signal handling' [1] added an export for
__put_task_struct_cb, a put_task_struct helper newly introduced in that
patch.  But the put_task_struct couldn't be used modular previously as
__put_task_struct wasn't exported.  There are not callers of it in modular
code, and it shouldn't be exported because we don't want drivers to hold
references to task_structs.

This patch removes the export and folds __put_task_struct into
__put_task_struct_cb as there's no other caller.

[1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-11 09:19:34 -08:00
Paul Mackerras
5164501794 Merge ../linux-2.6 2006-03-09 14:32:05 +11:00
Dipankar Sarma
529bf6be5c [PATCH] fix file counting
I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench.  Tested on both x86_64 and powerpc.

The way we do file struct accounting is not very suitable for batched
freeing.  For scalability reasons, file accounting was
constructor/destructor based.  This meant that nr_files was decremented
only when the object was removed from the slab cache.  This is susceptible
to slab fragmentation.  With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -

llm22:~ # cat /proc/sys/fs/file-nr
587730  0       758844

At the same time, I see only a 2000+ objects in filp cache.  The following
patch I fixes this problem.

This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api.  In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.

Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 14:14:01 -08:00
Dipankar Sarma
21a1ea9eb4 [PATCH] rcu batch tuning
This patch adds new tunables for RCU queue and finished batches.  There are
two types of controls - number of completed RCU updates invoked in a batch
(blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark,
qlowmark).

By default, the per-cpu batch limit is set to a small value.  If the input
RCU rate exceeds the high watermark, we do two things - force quiescent
state on all cpus and set the batch limit of the CPU to INTMAX.  Setting
batch limit to INTMAX forces all finished RCUs to be processed in one shot.
 If we have more than INTMAX RCUs queued up, then we have bigger problems
anyway.  Once the incoming queued RCUs fall below the low watermark, the
batch limit is set to the default.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 14:14:01 -08:00
Ingo Molnar
81c29a857d [PATCH] idle threads should have a sane ->timestamp value
Idle threads should have a sane ->timestamp value, to avoid init kernel
thread(s) from inheriting it and causing miscalculations in
try_to_wake_up().

Reported-by: Mike Galbraith <efault@gmx.de>.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 14:14:00 -08:00
Atsushi Nemoto
5aee405c66 [PATCH] time: add barrier after updating jiffies_64
Add a compiler barrier so that we don't read jiffies before updating
jiffies_64.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 18:40:44 -08:00
Tony Lindgren
69239749e1 [PATCH] fix next_timer_interrupt() for hrtimer
Also from Thomas Gleixner <tglx@linutronix.de>

Function next_timer_interrupt() got broken with a recent patch
6ba1b91213 as sys_nanosleep() was moved to
hrtimer.  This broke things as next_timer_interrupt() did not check hrtimer
tree for next event.

Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ,
VST) implementations, as the system can be in idle when next hrtimer event
was supposed to happen.  At least ARM and S390 currently use
next_timer_interrupt().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 18:40:44 -08:00
Linus Torvalds
8ba7b0a14b Add early-boot-safety check to cond_resched()
Just to be safe, we should not trigger a conditional reschedule during
the early boot sequence.  We've historically done some questionable
early on, and the safety warnings in __might_sleep() are generally
turned off during that period, so there might be problems lurking.

This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to
cause a voluntary conditional reschedule.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 17:38:49 -08:00
Christoph Lameter
685db65e42 [PATCH] time_interpolator: Use readq_relaxed() instead of readq().
On some platforms readq performs additional work to make sure I/O is done
in a coherent way.  This is not needed for time retrieval as done by the
time interpolator.  So we can use readq_relaxed instead which will improve
performance.

It affects sparc64 and ia64 only.  Apparently it makes a significant
difference on ia64.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02 08:33:07 -08:00
Stefan Seyfried
7f99f06f01 [PATCH] fix acpi_video_flags on x86-64
acpi_video_flags variable is unsigned long, so it should be set as such.
This actually matters on x86-64.

Signed-off-by: Stefan Seyfried <seife@suse.de>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Brown, Len" <len.brown@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-02 08:33:07 -08:00
Jes Sorensen
d2b176ed87 [IA64] sysctl option to silence unaligned trap warnings
Allow sysadmin to disable all warnings about userland apps
making unaligned accesses by using:
 # echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap
Rather than having to use prctl on a process by process basis.

Default behaivour leaves the warnings enabled.

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-02-28 09:42:23 -08:00
James Bottomley
1fa44ecad2 [SCSI] add execute_in_process_context() API
We have several points in the SCSI stack (primarily for our device
functions) where we need to guarantee process context, but (given the
place where the last reference was released) we cannot guarantee this.

This API gets around the issue by executing the function directly if
the caller has process context, but scheduling a workqueue to execute
in process context if the caller doesn't have it.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
2006-02-27 23:34:40 -06:00
Paul Mackerras
a00428f5b1 Merge ../powerpc-merge 2006-02-24 14:05:47 +11:00
Bjrn Steinbrink
5914811acf [PATCH] kjournald keeps reference to namespace
In daemonize() a new thread gets cleaned up and 'merged' with init_task.
The current fs_struct is handled there, but not the current namespace.

This adds the namespace part.

[ Eric Biederman pointed out the namespace wrappers, and also notes that
  we can't ever count on using our parents namespace because we already
  have called exit_fs(), which is the only way to the namespace from a
  process. ]

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:27:38 -08:00
Linus Torvalds
cf70a6f264 Merge branch 'fixes.b8' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bird 2006-02-20 20:09:44 -08:00
Stephen Rothwell
7fd105e758 [PATCH] Fix compile for CONFIG_SYSVIPC=n or CONFIG_SYSCTL=n
The compat syscalls are added to sys_ni.c since they are not defined if the
above CONFIG options are off.  Also, nfs would not build with CONFIG_SYSCTL
off.

Noticed by Arthur Othieno.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:11 -08:00
Luke Yang
7a9166e3b0 [PATCH] Fix undefined symbols for nommu architecture
Signed-off-by: Luke Yang <luke.adi@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:11 -08:00
Pavel Machek
c255d844dd [PATCH] suspend-to-ram: allow video options to be set at runtime
Currently, acpi video options can only be set on kernel command line.  That's
little inflexible; I'd like userland s2ram application that just works, and
modifying kernel command line according to whitelist is not fun.  It is better
to just allow s2ram application to set video options just before suspend
(according to the whitelist).

This implements sysctl to allow setting suspend video options without reboot.

(akpm: Documentation updates for this new sysctl are pending..)

Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Brown, Len" <len.brown@intel.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-20 20:00:10 -08:00
Al Viro
ef20c8c197 [PATCH] GFP_KERNEL allocations in atomic (auditsc)
audit_log_exit() is called from atomic contexts and gets explicit
gfp_mask argument; it should use it for all allocations rather
than doing some with gfp_mask and some with GFP_KERNEL.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-18 15:41:50 -05:00
Rafael J. Wysocki
a8534adb74 [PATCH] swsusp: fix breakage with swap on LVM
Restore the compatibility with the older code and make it possible to
suspend if the kernel command line doesn't contain the "resume=" argument

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 13:59:27 -08:00
Ingo Molnar
4bbf39c29b [PATCH] Introduce CONFIG_DEFAULT_MIGRATION_COST
Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

  The boot sequence on s390 sometimes takes ages and we spend a very long
  time (up to one or two minutes) in calibrate_migration_costs.  The time
  spent there differs from boot to boot.  Also the calculated costs differ
  a lot.  I've seen differences by up to a factor of 15 (yes, factor not
  percent).  Also I doubt that making these measurements make much sense on
  a completely virtualized architecture where you cannot tell how much cpu
  time you will get anyway.

So introduce the CONFIG_DEFAULT_MIGRATION_COST method for an architecture
to set the scheduler migration costs.  This turns off automatic detection
of migration costs.  Makes sense on virtual platforms, where migration
costs are hard to measure accurately.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 13:59:26 -08:00
Paul Mackerras
726c14bf49 [PATCH] Provide an interface for getting the current tick length
This provides an interface for arch code to find out how many
nanoseconds are going to be added on to xtime by the next call to
do_timer.  The value returned is a fixed-point number in 52.12 format
in nanoseconds.  The reason for this format is that it gives the
full precision that the timekeeping code is using internally.

The motivation for this is to fix a problem that has arisen on 32-bit
powerpc in that the value returned by do_gettimeofday drifts apart
from xtime if NTP is being used.  PowerPC is now using a lockless
do_gettimeofday based on reading the timebase register and performing
some simple arithmetic.  (This method of getting the time is also
exported to userspace via the VDSO.)  However, the factor and offset
it uses were calculated based on the nominal tick length and weren't
being adjusted when NTP varied the tick length.

Note that 64-bit powerpc has had the lockless do_gettimeofday for a
long time now.  It also had an extremely hairy routine that got called
from the 32-bit compat routine for adjtimex, which adjusted the
factor and offset according to what it thought the timekeeping code
was going to do.  Not only was this only called if a 32-bit task did
adjtimex (i.e. not if a 64-bit task did adjtimex), it was also
duplicating computations from kernel/timer.c and it wasn't clear that
it was (still) correct.

The simple solution is to ask the timekeeping code how long the
current jiffy will be on each timer interrupt, after calling
do_timer.  If this jiffy will be a different length from the last one,
we then need to compute new values for the factor and offset used in
the lockless do_gettimeofday.  In this way we can keep xtime and
do_gettimeofday in sync, even when NTP is varying the tick length.

Note that when adjtimex varies the tick length, it almost always
introduces the variation from the next tick on.  The only case I could
see where adjtimex would vary the length of the current tick is when
an old-style adjtime adjustment is being cancelled.  (It's not clear
to me why the adjustment has to be cancelled immediately rather than
from the next tick on.)  Thus I don't see any real need for a hook in
adjtimex; the rare case of an old-style adjustment being cancelled can
be fixed up at the next tick.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 08:24:29 -08:00
Andi Kleen
a62eaf151d [PATCH] x86_64: Add boot option to disable randomized mappings and cleanup
AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution
installation it's easiest if it's a boot time option.

Also I moved the variable to a more appropiate place and make
it independent from sysctl

And marked __read_mostly which it is.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-17 08:00:40 -08:00
Andrew Morton
c8adb494a6 [PATCH] swsusp: nuke noisy message
I get about 88 squillion of these when suspending an old ad450nx server.

Cc: Pavel Roskin <proski@gnu.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15 15:32:22 -08:00
Paul Jackson
06fed33849 [PATCH] cpuset: oops in exit on null cpuset fix
Fix a latent bug in cpuset_exit() handling.  If a task tried to allocate
memory after calling cpuset_exit(), it oops'd in
cpuset_update_task_memory_state() on a NULL cpuset pointer.

So set the exiting tasks cpuset to the root cpuset instead of to NULL.

A distro kernel hit this with an added kernel package that had just such a
hook (allocating memory) in the exit code path.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15 15:32:21 -08:00
Oleg Nesterov
5ecfbae093 [PATCH] fix zap_thread's ptrace related problems
1. The tracee can go from ptrace_stop() to do_signal_stop()
   after __ptrace_unlink(p).

2. It is unsafe to __ptrace_unlink(p) while p->parent may wait
   for tasklist_lock in ptrace_detach().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15 11:05:43 -08:00
Oleg Nesterov
dadac81b1b [PATCH] fix kill_proc_info() vs fork() theoretical race
copy_process:

	attach_pid(p, PIDTYPE_PID, p->pid);
	attach_pid(p, PIDTYPE_TGID, p->tgid);

What if kill_proc_info(p->pid) happens in between?

copy_process() holds current->sighand.siglock, so we are safe
in CLONE_THREAD case, because current->sighand == p->sighand.

Otherwise, p->sighand is unlocked, the new process is already
visible to the find_task_by_pid(), but have a copy of parent's
'struct pid' in ->pids[PIDTYPE_TGID].

This means that __group_complete_signal() may hang while doing

	do ... while (next_thread() != p)

We can solve this problem if we reverse these 2 attach_pid()s:

	attach_pid() does wmb()

	group_send_sig_info() calls spin_lock(), which
	provides a read barrier. // Yes ?

I don't think we can hit this race in practice, but still.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15 10:21:24 -08:00
Oleg Nesterov
3f17da6994 [PATCH] fix kill_proc_info() vs CLONE_THREAD race
There is a window after copy_process() unlocks ->sighand.siglock
and before it adds the new thread to the thread list.

In that window __group_complete_signal(SIGKILL) will not see the
new thread yet, so this thread will start running while the whole
thread group was supposed to exit.

I beleive we have another good reason to place attach_pid(PID/TGID)
under ->sighand.siglock. We can do the same for

	release_task()->__unhash_process()

	de_thread()->switch_exec_pids()

After that we don't need tasklist_lock to iterate over the thread
list, and we can simplify things, see for example do_sigaction()
or sys_times().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15 10:21:23 -08:00
Ingo Molnar
06027bdd27 [PATCH] hrtimer: round up relative start time on low-res arches
CONFIG_TIME_LOW_RES is a temporary way for architectures to signal that
they simply return xtime in do_gettimeoffset().  In this corner-case we
want to round up by resolution when starting a relative timer, to avoid
short timeouts.  This will go away with the GTOD framework.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:35 -08:00
Chen, Kenneth W
d6077cb80c [PATCH] sched: revert "filter affine wakeups"
Revert commit d7102e95b7:

    [PATCH] sched: filter affine wakeups

Apparently caused more than 10% performance regression for aim7 benchmark.
The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144
disks.  Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who
supplied benchmark results).

The problem is, for aim7, the wake-up pattern is random, but it still needs
load balancing action in the wake-up path to achieve best performance.  With
the above commit, lack of load balancing hurts that workload.

However, for workloads like database transaction processing, the requirement
is exactly opposite.  In the wake up path, best performance is achieved with
absolutely zero load balancing.  We simply wake up the process on the CPU that
it was previously run.  Worst performance is obtained when we do load
balancing at wake up.

There isn't an easy way to auto detect the workload characteristics.  Ingo's
earlier patch that detects idle CPU and decide whether to load balance or not
doesn't perform with aim7 either since all CPUs are busy (it causes even
bigger perf.  regression).

Revert commit d7102e95b7, which causes more
than 10% performance regression with aim7.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:34 -08:00
Hugh Dickins
16bf134840 [PATCH] compound page: no access_process_vm check
The PageCompound check before access_process_vm's set_page_dirty_lock is no
longer necessary, so remove it.  But leave the PageCompound checks in
bio_set_pages_dirty, dio_bio_complete and nfs_free_user_pages: at least some
of those were introduced as a little optimization on hugetlb pages.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:33 -08:00
Jan Beulich
c22db94127 [PATCH] prevent recursive panic from softlockup watchdog
When panic_timeout is zero, suppress triggering a nested panic due to soft
lockup detection.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10 08:13:12 -08:00
Nick Piggin
a2000572ad [PATCH] sched: remove smpnice
I don't think the code is quite ready, which is why I asked for Peter's
additions to also be merged before I acked it (although it turned out that
it still isn't quite ready with his additions either).

Basically I have had similar observations to Suresh in that it does not
play nicely with the rest of the balancing infrastructure (and raised
similar concerns in my review).

The samples (group of 4) I got for "maximum recorded imbalance" on a 2x2
SMP+HT Xeon are as follows:

            | Following boot | hackbench 20        | hackbench 40
 -----------+----------------+---------------------+---------------------
 2.6.16-rc2 | 30,37,100,112  | 5600,5530,6020,6090 | 6390,7090,8760,8470
 +nosmpnice |  3, 2,  4,  2  |   28, 150, 294, 132 |  348, 348, 294, 347

Hackbench raw performance is down around 15% with smpnice (but that in
itself isn't a huge deal because it is just a benchmark).  However, the
samples show that the imbalance passed into move_tasks is increased by
about a factor of 10-30.  I think this would also go some way to explaining
latency blips turning up in the balancing code (though I haven't actually
measured that).

We'll probably have to revert this in the SUSE kernel.

Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: "Martin J. Bligh" <mbligh@aracnet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-10 08:13:11 -08:00
Jon Mason
2ef9481e66 [PATCH] powerpc: trivial: modify comments to refer to new location of files
This patch removes all self references and fixes references to files
in the now defunct arch/ppc64 tree.  I think this accomplises
everything wanted, though there might be a few references I missed.

Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-10 16:53:51 +11:00
Oleg Nesterov
9ac95f2f90 [PATCH] do_sigaction: cleanup ->sa_mask manipulation
Clear unblockable signals beforehand.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-09 16:17:36 -08:00
Oleg Nesterov
c70d3d703a [PATCH] sys_signal: initialize ->sa_mask
Pointed out by Linus Torvalds.

sys_signal() forgets to initialize ->sa_mask.

( I suspect arch/ia64/ia32/ia32_signal.c:sys32_signal()
  also needs this fix )

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-09 16:17:36 -08:00
Al Viro
4bb8089c86 [PATCH] kernel/sys.c NULL noise removal
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07 20:57:47 -05:00
Al Viro
53f087febf [PATCH] timer.c NULL noise removal
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07 20:57:42 -05:00
Al Viro
1b8623545b [PATCH] remove bogus asm/bug.h includes.
A bunch of asm/bug.h includes are both not needed (since it will get
pulled anyway) and bogus (since they are done too early).  Removed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07 20:56:35 -05:00
JANAK DESAI
a016f3389c [PATCH] unshare system call -v5: unshare files
If the file descriptor structure is being shared, allocate a new one and copy
information from the current, shared, structure.

Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:34 -08:00
JANAK DESAI
a0a7ec308f [PATCH] unshare system call -v5: unshare vm
If vm structure is being shared, allocate a new one and copy information from
the current, shared, structure.

Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:34 -08:00
JANAK DESAI
741a295130 [PATCH] unshare system call -v5: unshare namespace
If the namespace structure is being shared, allocate a new one and copy
information from the current, shared, structure.

Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:34 -08:00
JANAK DESAI
99d1419d96 [PATCH] unshare system call -v5: unshare filesystem info
If filesystem structure is being shared, allocate a new one and copy
information from the current, shared, structure.

Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:34 -08:00
JANAK DESAI
cf2e340f42 [PATCH] unshare system call -v5: system call handler function
sys_unshare system call handler function accepts the same flags as clone
system call, checks constraints on each of the flags and invokes corresponding
unshare functions to disassociate respective process context if it was being
shared with another task.

Here is the link to a program for testing unshare system call.

http://prdownloads.sourceforge.net/audit/unshare_test.c?download

Please note that because of a problem in rmdir associated with bind mounts and
clone with CLONE_NEWNS, the test fails while trying to remove temporary test
directory.  You can remove that temporary directory by doing rmdir, twice,
from the command line.  The first will fail with EBUSY, but the second will
succeed.  I have reported the problem to Ram Pai and Al Viro with a small
program which reproduces the problem.  Al told us yesterday that he will be
looking at the problem soon.  I have tried multiple rmdirs from the
unshare_test program itself, but for some reason that is not working.  Doing
two rmdirs from command line does seem to remove the directory.

Signed-off-by: Janak Desai <janak@us.ibm.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:34 -08:00
Rafael J. Wysocki
46cd2f32ba [PATCH] Fix build failure in recent pm_prepare_* changes.
Fix compilation problem in PM headers.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:33 -08:00
Andrew Morton
8e08b75686 [PATCH] module: strlen_user() race fix
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:32 -08:00
Pavel Machek
7714d5985b [PATCH] swsusp: kill unneeded/unbalanced bio_get
- Remove unneeded bio_get() which would cause a bio leak

- Writing doesn't dirty pages.  Reading dirties pages.

- We should dirty the pages after the IO completion, not before

(Busy-waiting for disk I/O completion isn't very polite.)

Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-07 16:12:31 -08:00
Dave Jones
5c0d5d262a [PATCH] missing license tag in intermodule
It may suck something awful, but it shouldn't taint the kernel.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:52 -08:00
Chuck Ebbert
bd576c9523 [PATCH] sched: only print migration_cost once per boot
migration_cost prints after every CPU hotplug event.  Make it print only
once at boot.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:51 -08:00
Eric Dumazet
88a2a4ac6b [PATCH] percpu data: only iterate over possible CPUs
percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
cpudata, instead of allocating memory only for possible cpus.

As a preparation for changing that, we need to convert various 0 -> NR_CPUS
loops to use for_each_cpu().

(The above only applies to users of asm-generic/percpu.h.  powerpc has gone it
alone and is presently only allocating memory for present CPUs, so it's
currently corrupting memory).

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-05 11:06:51 -08:00
Andrew Morton
514a01b880 [PATCH] uninline __sigqueue_free()
Five callsites.  I dunno how all this crap got back in there :(

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03 08:32:10 -08:00
Randy Dunlap
fe85a998ca [PATCH] cpuset: fix sparse warning
kernel/cpuset.c:644:38: warning: non-ANSI function declaration of function 'cpuset_update_task_memory_state'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03 08:32:06 -08:00
George Anzinger
88fc3897e3 [PATCH] Normalize timespec for negative values in ns_to_timespec
- In case of a negative nsec value the result of the division must be
  normalized.

- Remove inline from an exported function.

Signed-off-by: George Anzinger <george@wildturkeyranch.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03 08:32:06 -08:00
Keith Owens
54e8ce463a [PATCH] Tell kallsyms_lookup_name() to ignore type U entries
When one module exports a function symbol and another module uses that
symbol then kallsyms shows the symbol twice.  Once from the consumer with a
type of 'U' and once from the provider with a type of 't' or 'T'.  On most
architectures, both entries have the same address so it does not matter
which one is returned by kallsyms_lookup_name().  But on architectures with
function descriptors, the 'U' entry points to the descriptor, not to the
code body, which is not what we want.

IA64 # grep -w qla2x00_remove_one /proc/kallsyms
a000000208c25ef8 U qla2x00_remove_one   [qla2300]   <= descriptor
a000000208bf44c0 t qla2x00_remove_one   [qla2xxx]   <= function body

Tell kallsyms_lookup_name() to ignore type U entries in modules.

Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03 08:32:02 -08:00
Ananth N Mavinakayanahalli
278ff95370 [PATCH] Kprobes: Fix deadlock in function-return probes
When two function-return probes are inserted on kfree()[1] and the second
on say, sys_link()[2], and later [2] is unregistered, we have a deadlock as
kfree is called with the kretprobe_lock held and the function-return probe
on kfree will also try to grab the same lock.

However, we can move the kfree() during unregistration to outside the
spinlock as we are sure that no instances from the free list will be used
after synchronized_sched() returns during the unregistration process.
Thanks to Masami Hiramatsu for spotting this.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03 08:32:00 -08:00
Adrian Bunk
e65cefe87b [PATCH] kernel/kprobes.c: fix a warning #ifndef ARCH_SUPPORTS_KRETPROBES
kernel/kprobes.c:353: warning: 'pre_handler_kretprobe' defined but not used

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-03 08:32:00 -08:00
Linus Torvalds
59ed2f59e4 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 2006-02-01 22:06:15 -08:00
Christoph Lameter
2a11ff06d7 [PATCH] zone_reclaim: configurable off node allocation period.
Currently the zone_reclaim code has a fixed window of 30 seconds of off node
allocations should a local zone have no unused pagecache pages left.  Reclaim
will be attempted again after this timeout period to avoid repeated useless
scans for memory.  This is also useful to established sufficiently large off
node allocation chunks to relieve the local node.

It may be beneficial to adjust that time period for some special situations.
For example if memory use was exceeding node capacity one may want to give up
for longer periods of time.  If memory spikes intermittendly then one may want
to shorten the time period to reduce the number of off node allocations.

This patch allows just that....

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:16 -08:00
Christoph Lameter
c84db23c6e [PATCH] zone_reclaim: minor fixes
- If we only reclaim nr_pages then its okay to stay on node.
  Switch from > to >= for the comparison.

- vm_table[] entry for zone_reclaim_mode is a bit screwed up.

- Add empty lines around shrink_zone to show that this is the
  central function to be called.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:15 -08:00
Rafael J. Wysocki
f7b8988ff5 [PATCH] swsusp: do not change log level during suspend/resume
Prevent the kernel from setting the log level to 10 unconditionally during
suspend/resume which was needed in the past for debugging, but generally is
undesirable.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:14 -08:00
Jack Steiner
2f7016d917 [PATCH] sys_sched_getaffinity() & hotplug
Change sched_getaffinity() so that it returns a bitmap that indicates the
legally schedulable cpus that a task is allowed to run on.

Without this patch, if CONFIG_HOTPLUG_CPU is enabled, sched_getaffinity()
unconditionally returns (at least on IA64) a mask with NR_CPUS bits set.
This conveys no useful infornmation except for a kernel compile option.

This fixes a breakage we obseved running recent kernels. We have MPI jobs
that use sched_getaffinity() to determine where to place their threads.
Placing them on non-existant cpus is problematic :-)

Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:13 -08:00
Adrian Bunk
493f01d1d0 [PATCH] kernel/posix-timers.c: remove do_posix_clock_notimer_create()
This function is neither used nor has any real contents.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:13 -08:00
Thomas Gleixner
952bbc87f0 [PATCH] hrtimers: set correct initial expiry time for relative SIGEV_NONE timers
The expiry time for relative timers with SIGEV_NONE set was never
updated to the correct value.

Pointed out by George Anzinger.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:13 -08:00
Thomas Gleixner
66188fae3b [PATCH] hrtimers: add back lost credit lines
At some point we added credits to people who actively helped to bring
k/hr-timers along.  This was lost in the big code revamp.  Add it back.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:13 -08:00
George Anzinger
7978672c4d [PATCH] hrtimers: cleanups and simplifications
Clean up the interface to hrtimers by changing the init code to pass the mode
as well as the clock.  This allow the init code to select the correct base and
eliminates extra timer re-init code in posix-timers.  We also simplify the
restart interface nanosleep use.

Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:13 -08:00
akpm@osdl.org
ff60a5dc4f [PATCH] hrtimers: fix posix-timer requeue race
From: Steven Rostedtrostedt@goodmis.org <rostedt@goodmis.org>

CPU0 expires a posix-timer and runs the callback function.  The signal is
queued.

After releasing the posix-timer lock and before returning to hrtimer_run_queue
CPU0 gets interrupted.  CPU1 delivers the queued signal and rearms the timer.
CPU0 comes back to hrtimer_run_queue and sets the timer state to expired.

The next modification of the timer can result in an oops, because the state
information is wrong.

Keep track of state = RUNNING and check if the state has been in the return
path of hrtimer_run_queue.  In case the state has been changed, ignore a
restart request and do not touch the state variable.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:13 -08:00
Thomas Gleixner
a16a1c095a [PATCH] hrtimers: fix oldvalue return in setitimer
This resolves bugzilla bug#5617.  The oldvalue of the timer was read after the
timer was cancelled, so the remaining time was always zero.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:12 -08:00
Thomas Gleixner
b6557fbca8 [PATCH] hrtimers: fix possible use of NULL pointer in posix-timers
Fixup the conversion of posix-timers to hrtimers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:12 -08:00
Thomas Gleixner
bc1978d404 [PATCH] hrtimers: fixup itimer conversion
The itimer conversion removed the locking which protects the timer and
variables in the shared signal structure.  Steven Rostedt found the problem in
the latest -rt patches.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:12 -08:00
Rafael J. Wysocki
853609b61e [PATCH] swsusp: use bytes as image size units
Make swsusp use bytes as the image size units, which is needed for future
compatibility.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-01 08:53:12 -08:00
Andrew Morton
3fa97c9db4 [PATCH] "Fix uidhash_lock <-> RXU deadlock" fix
I get storms of warnings from local_bh_enable().  Better-tested patches,
please.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31 16:49:43 -08:00
Ingo Molnar
adac166523 [PATCH] rcu_torture_lock deadlock fix
rcu_torture_lock is used in a softirq-unsafe manner, but it is also
taken by rcu_torture_cb(), which may execute in softirq-context,
resulting in potential deadlocks.

The fix is to acquire rcu_torture_lock in a softirq-safe manner.  With
this fix applied, the rcu-torture code passes validation.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31 11:30:18 -08:00
Ingo Molnar
4021cb279a [PATCH] fix uidhash_lock <-> RCU deadlock
RCU task-struct freeing can call free_uid(), which is taking
uidhash_lock - while other users of uidhash_lock are softirq-unsafe.

The fix is to always take the uidhash_spinlock in a softirq-safe manner.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31 11:30:18 -08:00
Ingo Molnar
70b4d63e98 [PATCH] Fix boot-time slowdown for measure_migration_cost
This reduces the amount of time the migration cost calculations cost
during bootup. Based on numbers by Tony Luck <tony.luck@intel.com>.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-31 10:23:31 -08:00
Linus Torvalds
951069e311 Don't try to "validate" a non-existing timeval.
settime() with a NULL timeval is silly but legal.

Noticed by Dave Jones <davej@redhat.com>

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-31 10:16:55 -08:00
Len Brown
9fdb62af92 [ACPI] merge 3549 4320 4485 4588 4980 5483 5651 acpica asus fops pnpacpi branches into release
Signed-off-by: Len Brown <len.brown@intel.com>
2006-01-24 17:52:48 -05:00
Alan Cox
715b49ef2d [PATCH] EDAC: atomic scrub operations
EDAC requires a way to scrub memory if an ECC error is found and the chipset
does not do the work automatically.  That means rewriting memory locations
atomically with respect to all CPUs _and_ bus masters.  That means we can't
use atomic_add(foo, 0) as it gets optimised for non-SMP

This adds a function to include/asm-foo/atomic.h for the platforms currently
supported which implements a scrub of a mapped block.

It also adjusts a few other files include order where atomic.h is included
before types.h as this now causes an error as atomic_scrub uses u32.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:30 -08:00
David Woodhouse
150256d8aa [PATCH] Generic sys_rt_sigsuspend()
The TIF_RESTORE_SIGMASK flag allows us to have a generic implementation of
sys_rt_sigsuspend() instead of duplicating it for each architecture.  This
provides such an implementation and makes arch/powerpc use it.

It also tidies up the ppc32 sys_sigsuspend() to use TIF_RESTORE_SIGMASK.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:29 -08:00
Jason Baron
c21761f168 [PATCH] fix sched_setscheduler semantics
Currently, a negative policy argument passed into the
'sys_sched_setscheduler()' system call, will return with success.  However,
the manpage for 'sys_sched_setscheduler' says:

EINVAL The scheduling policy is not one of the recognized policies, or the
              parameter p does not make sense for the policy.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:22 -08:00
Christoph Lameter
1743660b91 [PATCH] Zone reclaim: proc override
proc support for zone reclaim

This patch creates a proc entry /proc/sys/vm/zone_reclaim_mode that may be
used to override the automatic determination of the zone reclaim made on
bootup.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:17 -08:00
Ingo Molnar
ea13dbc89c [PATCH] kernel/hrtimer.c sparse warning fix
fix the following sparse warning:

 kernel/hrtimer.c:665:34: warning: incorrect type in argument 2 (different address spaces)
 kernel/hrtimer.c:665:34:    expected void const *from
 kernel/hrtimer.c:665:34:    got struct timespec [noderef] *<noident><asn:1>
 kernel/hrtimer.c:664:2: warning: dereference of noderef expression

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16 23:21:12 -08:00
Adrian Bunk
fd279197b1 [PATCH] build kernel/intermodule.c only when required
Build kernel/intermodule.c only when required.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16 23:15:26 -08:00
Jonathan Corbet
8dca6f33f0 [PATCH] hrtimer comment tweak
Fix a comment which missed an update cycle somewhere.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-16 20:27:03 -08:00
Linus Torvalds
3f02d072d4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial 2006-01-15 16:43:29 -08:00
Paul Jackson
505970b96e [PATCH] cpuset oom lock fix
The problem, reported in:

  http://bugzilla.kernel.org/show_bug.cgi?id=5859

and by various other email messages and lkml posts is that the cpuset hook
in the oom (out of memory) code can try to take a cpuset semaphore while
holding the tasklist_lock (a spinlock).

One must not sleep while holding a spinlock.

The fix seems easy enough - move the cpuset semaphore region outside the
tasklist_lock region.

This required a few lines of mechanism to implement.  The oom code where
the locking needs to be changed does not have access to the cpuset locks,
which are internal to kernel/cpuset.c only.  So I provided a couple more
cpuset interface routines, available to the rest of the kernel, which
simple take and drop the lock needed here (cpusets callback_sem).

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:27:10 -08:00
Martin Schwidefsky
0152fb3760 [PATCH] s390: spinlock fixes
Remove useless spin_retry_counter and fix compilation for UP kernels.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:27:09 -08:00
Arjan van de Ven
858119e159 [PATCH] Unlinline a bunch of other functions
Remove the "inline" keyword from a bunch of big functions in the kernel with
the goal of shrinking it by 30kb to 40kb

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:27:06 -08:00
Ingo Molnar
b0a9499c3d [PATCH] sched: add new SCHED_BATCH policy
Add a new SCHED_BATCH (3) scheduling policy: such tasks are presumed
CPU-intensive, and will acquire a constant +5 priority level penalty.  Such
policy is nice for workloads that are non-interactive, but which do not
want to give up their nice levels.  The policy is also useful for workloads
that want a deterministic scheduling policy without interactivity causing
extra preemptions (between that workload's tasks).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14 18:25:20 -08:00
Christian Kujau
624dffcbcf correct email address of Manfred Spraul
I  tried to send the forcedeth maintainer an email, but it came back with:

"The mail address manfreds@colorfullife.com is not read anymore.
Please resent your mail to manfred@ instead of manfreds@."

This patch fixes this.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-01-15 02:43:54 +01:00
Adrian Bunk
750c902ef4 SOFTWARE_SUSPEND: fix a typo in the dependencies
This patch fixes a typo in the dependencies of SOFTWARE_SUSPEND.

This patch is based on a report by
Jean-Luc Leger <reiga@dspnet.fr.eu.org>.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
2006-01-15 02:01:39 +01:00
Linus Torvalds
661dd5c840 Merge master.kernel.org:/pub/scm/linux/kernel/git/tglx/hrtimer-2.6 2006-01-12 10:22:11 -08:00
akpm@osdl.org
d7102e95b7 [PATCH] sched: filter affine wakeups
)

From: Nick Piggin <nickpiggin@yahoo.com.au>

Track the last waker CPU, and only consider wakeup-balancing if there's a
match between current waker CPU and the previous waker CPU.  This ensures
that there is some correlation between two subsequent wakeup events before
we move the task.  Should help random-wakeup workloads on large SMP
systems, by reducing the migration attempts by a factor of nr_cpus.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:50 -08:00
akpm@osdl.org
198e2f1811 [PATCH] scheduler cache-hot-autodetect
)

From: Ingo Molnar <mingo@elte.hu>

This is the latest version of the scheduler cache-hot-auto-tune patch.

The first problem was that detection time scaled with O(N^2), which is
unacceptable on larger SMP and NUMA systems. To solve this:

- I've added a 'domain distance' function, which is used to cache
  measurement results. Each distance is only measured once. This means
  that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
  distances 0 and 1, and on SMP distance 0 is measured. The code walks
  the domain tree to determine the distance, so it automatically follows
  whatever hierarchy an architecture sets up. This cuts down on the boot
  time significantly and removes the O(N^2) limit. The only assumption
  is that migration costs can be expressed as a function of domain
  distance - this covers the overwhelming majority of existing systems,
  and is a good guess even for more assymetric systems.

  [ People hacking systems that have assymetries that break this
    assumption (e.g. different CPU speeds) should experiment a bit with
    the cpu_distance() function. Adding a ->migration_distance factor to
    the domain structure would be one possible solution - but lets first
    see the problem systems, if they exist at all. Lets not overdesign. ]

Another problem was that only a single cache-size was used for measuring
the cost of migration, and most architectures didnt set that variable
up. Furthermore, a single cache-size does not fit NUMA hierarchies with
L3 caches and does not fit HT setups, where different CPUs will often
have different 'effective cache sizes'. To solve this problem:

- Instead of relying on a single cache-size provided by the platform and
  sticking to it, the code now auto-detects the 'effective migration
  cost' between two measured CPUs, via iterating through a wide range of
  cachesizes. The code searches for the maximum migration cost, which
  occurs when the working set of the test-workload falls just below the
  'effective cache size'. I.e. real-life optimized search is done for
  the maximum migration cost, between two real CPUs.

  This, amongst other things, has the positive effect hat if e.g. two
  CPUs share a L2/L3 cache, a different (and accurate) migration cost
  will be found than between two CPUs on the same system that dont share
  any caches.

(The reliable measurement of migration costs is tricky - see the source
for details.)

Furthermore i've added various boot-time options to override/tune
migration behavior.

Firstly, there's a blanket override for autodetection:

	migration_cost=1000,2000,3000

will override the depth 0/1/2 values with 1msec/2msec/3msec values.

Secondly, there's a global factor that can be used to increase (or
decrease) the autodetected values:

	migration_factor=120

will increase the autodetected values by 20%. This option is useful to
tune things in a workload-dependent way - e.g. if a workload is
cache-insensitive then CPU utilization can be maximized by specifying
migration_factor=0.

I've tested the autodetection code quite extensively on x86, on 3
P3/Xeon/2MB, and the autodetected values look pretty good:

Dual Celeron (128K L2 cache):

 ---------------------
 migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
 ---------------------
           [00]    [01]
 [00]:     -     1.7(1)
 [01]:   1.7(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (0) 1.7 (1784008)
 ---------------------

Here the slow memory subsystem dominates system performance, and even
though caches are small, the migration cost is 1.7 msecs.

Dual HT P4 (512K L2 cache):

 ---------------------
 migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
 ---------------------
           [00]    [01]    [02]    [03]
 [00]:     -     0.4(1)  0.0(0)  0.4(1)
 [01]:   0.4(1)    -     0.4(1)  0.0(0)
 [02]:   0.0(0)  0.4(1)    -     0.4(1)
 [03]:   0.4(1)  0.0(0)  0.4(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (33900) 0.4 (448514)
 ---------------------

Here it can be seen that there is no migration cost between two HT
siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.

8-way P3/Xeon [2MB L2 cache]:

 ---------------------
 migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
 ---------------------
           [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
 [00]:     -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [01]:  19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [02]:  19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [03]:  19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1)
 [04]:  19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1)
 [05]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1)
 [06]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1)
 [07]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -
 ---------------------
 cacheflush times [2]: 0.0 (0) 19.2 (19281756)
 ---------------------

This one has huge caches and a relatively slow memory subsystem - so the
migration cost is 19 msecs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: <wilder@us.ibm.com>
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:50 -08:00
Thomas Gleixner
c9db4fa115 [hrtimer] Enforce resolution as lower limit of intervals
Roman Zippel pointed out that the missing lower limit of intervals
leads to an accounting error in the overrun count. Enforce the lower
limit of intervals to resolution in the timer forwarding code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2006-01-12 11:47:34 +01:00
Thomas Gleixner
e2787630c1 [hrtimer] Change resolution storage to ktime_t format
Change the storage format of the per base resolution to ktime_t to
make it easier accessible in the hrtimers code.

Change the resolution from (NSEC_PER_SEC/HZ) to TICK_NSEC as Roman
pointed out. TICK_NSEC is closer to the real resolution.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2006-01-12 11:36:14 +01:00
Thomas Gleixner
288867ec5c [hrtimer] Remove listhead from hrtimer struct
The list_head in the hrtimer structure was introduced for easy access
to the first timer with the further extensions of real high resolution
timers in mind, but it turned out in the course of development that
it is not necessary for the standard use case. Remove the list head
and access the first expiry timer by a datafield in the timer base.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2006-01-12 11:25:54 +01:00
Ravikiran G Thirumalai
5fd63b3085 [PATCH] x86_64: Inclusion of ScaleMP vSMP architecture patches - vsmp_align
vSMP specific alignment patch to
1. Define INTERNODE_CACHE_SHIFT for vSMP
2. Use this for alignment of critical structures
3. Use INTERNODE_CACHE_SHIFT for ARCH_MIN_TASKALIGN,
   and let the slab align task_struct allocations to the internode cacheline size
4. Introduce and use ARCH_MIN_MMSTRUCT_ALIGN for mm_struct slab allocations.

Signed-off-by: Ravikiran Thirumalai <kiran@scalemp.com>
Signed-off-by: Shai Fultheim <shai@scalemp.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 19:05:01 -08:00
Andi Kleen
4cef0c6138 [PATCH] x86_64: Make the cpu_*_maps in kernel/sched.c read mostly
They are referred to often so avoid potential false sharing for them.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 19:04:56 -08:00
Randy.Dunlap
c59ede7b78 [PATCH] move capable() to capability.h
- Move capable() from sched.h to capability.h;

- Use <linux/capability.h> where capable() is used
	(in include/, block/, ipc/, kernel/, a few drivers/,
	mm/, security/, & sound/;
	many more drivers/ to go)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:13 -08:00
Ingo Molnar
e16885c5ad [PATCH] uninline capable()
Uninline capable().  Saves 2K of kernel text on a generic .config, and 1K on a
tiny config.  In addition it makes the use of capable more consistent between
CONFIG_SECURITY and !CONFIG_SECURITY

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:13 -08:00
Keshavamurthy Anil S
df019b1d8b [PATCH] kprobes: fix unloading of self probed module
When a kprobes modules is written in such a way that probes are inserted on
itself, then unload of that moudle was not possible due to reference
couning on the same module.

The below patch makes a check and incrementes the module refcount only if
it is not a self probed module.

We need to allow modules to probe themself for kprobes performance
measurements

This patch has been tested on several x86_64, ppc64 and IA64 architectures.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:12 -08:00
David Woodhouse
a4fc7ab1d0 [PATCH] fix/simplify mutex debugging code
Let's switch mutex_debug_check_no_locks_freed() to take (addr, len) as
arguments instead, since all its callers were just calculating the 'to'
address for themselves anyway... (and sometimes doing so badly).

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 08:14:16 -08:00
Ingo Molnar
02706647a4 [PATCH] mutex: trivial whitespace cleanups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 14:27:59 -08:00
Ingo Molnar
c544bdb199 [PATCH] mark mutex_lock*() as might_sleep()
Mark mutex_lock() and mutex_lock_interruptible() as might_sleep()
functions.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 13:20:47 -08:00
Ingo Molnar
73165b88ff [PATCH] fix i386 mutex fastpath on FRAME_POINTER && !DEBUG_MUTEXES
Call the mutex slowpath more conservatively - e.g.  FRAME_POINTERS can
change the calling convention, in which case a direct branch to the
slowpath becomes illegal.  Bug found by Hugh Dickins.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 13:20:47 -08:00
Ingo Molnar
042c904c3e [PATCH] remove unnecessary asm/mutex.h from kernel/mutex-debug.c
Remove unnecessary (and incorrect) inclusion of asm/mutex.h, pointed out
by David Howells.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 13:20:47 -08:00
Oleg Nesterov
a9c828155a [PATCH] rcu: fix hotplug-cpu ->donelist leak
Pointed out by Srivatsa Vaddagiri <vatsa@in.ibm.com>.

rcu_do_batch() stops after processing maxbatch callbacks
on ->donelist leaving rcu_tasklet in TASKLET_STATE_SCHED
state.

If CPU_DEAD event happens remaining ->donelist entries are
lost, rcu_offline_cpu() kills this tasklet.

With this patch ->donelist migrates along with ->curlist
and ->nxtlist to the current cpu.

Compile tested.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:49:47 -08:00
Oleg Nesterov
69a0b31579 [PATCH] rcu: join rcu_ctrlblk and rcu_state
This patch moves rcu_state into the rcu_ctrlblk. I think there
are no reasons why we should have 2 different variables to control
rcu state. Every user of rcu_state has also "rcu_ctrlblk *rcp" in
the parameter list.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:42:50 -08:00
Adrian Bunk
d974837ae0 [PATCH] kernel/resource.c: __check_region(): remove pointless __deprecated
If a __deprecated is desired it should go to the prototype in the header
(where it currently isn't).

But at this place it's pointless.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:02:02 -08:00
Jesper Juhl
3795e1616f [PATCH] Decrease number of pointer derefs in exit.c
Decrease the number of pointer derefs in kernel/exit.c

Benefits of the patch:
 - Fewer pointer dereferences should make the code slightly faster.
 - Size of generated code is smaller
 - improved readability

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:02:01 -08:00
Keshavamurthy Anil S
a0d50069ed [PATCH] Kprobes: conversion from kcalloc to kzalloc
Signed-of-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:41 -08:00
Ananth N Mavinakayanahalli
0498b63504 [PATCH] kprobes: fix build breakage
The following patch (against 2.6.15-rc5-mm3) fixes a kprobes build break
due to changes introduced in the kprobe locking in 2.6.15-rc5-mm3.  In
addition, the patch reverts back the open-coding of kprobe_mutex.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:40 -08:00
Anil S Keshavamurthy
e597c2984c [PATCH] kprobes: arch_remove_kprobe
Currently arch_remove_kprobes() is only implemented/required for x86_64 and
powerpc.  All other architecture like IA64, i386 and sparc64 implementes a
dummy function which is being called from arch independent kprobes.c file.

This patch removes the dummy functions and replaces it with
#define arch_remove_kprobe(p, s)	do { } while(0)

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:40 -08:00
Keshavamurthy Anil S
f709b12234 [PATCH] kprobes-changed-from-using-spinlock-to-mutex fix
Based on some feedback from Oleg Nesterov, I have made few changes to
previously posted patch.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:40 -08:00
Anil S Keshavamurthy
49a2a1b83b [PATCH] kprobes: changed from using spinlock to mutex
Since Kprobes runtime exception handlers is now lock free as this code path is
now using RCU to walk through the list, there is no need for the
register/unregister{_kprobe} to use spin_{lock/unlock}_isr{save/restore}.  The
serialization during registration/unregistration is now possible using just a
mutex.

In the above process, this patch also fixes a minor memory leak for x86_64 and
powerpc.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:40 -08:00
Anil S Keshavamurthy
2d14e39da8 [PATCH] kprobes: enable funcions only for required arch
Kernel/kprobes.c defines get_insn_slot() and free_insn_slot() which are
currently required _only_ for x86_64 and powerpc (which has no-exec support).

FYI, get{free}_insn_slot() functions manages the memory page which is mapped
as executable, required for instruction emulation.

This patch moves those two functions under __ARCH_WANT_KPROBES_INSN_SLOT and
defines __ARCH_WANT_KPROBES_INSN_SLOT in arch specific kprobes.h file.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:39 -08:00
Matt Helsley
d1c0b8f835 [PATCH] Remove getnstimestamp()
Remove getnstimestamp() in favor of ktime.h's ktime_get_ts()

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:39 -08:00
Matt Helsley
69778e325c [PATCH] Export ktime_get_ts()
This series removes the getnstimestamp() function from kernel/time.c in favor
of kernel/hrtimer.c's ktime_get_ts() function which currently does exactly the
same thing: retrieves a high-resolution (ns) timespec structure and performs
the wall_to_monotonic adjustment.

This patch:

Export ktime_get_ts() to be used as a timestamp function since it uses
getnstimefoday() and does the wall_to_monotonic adjustment.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:39 -08:00
Thomas Gleixner
becf8b5d00 [PATCH] hrtimer: convert posix timers completely
- convert posix-timers.c to use hrtimers

- remove the now obsolete abslist code

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:39 -08:00
Thomas Gleixner
97735f25d2 [PATCH] hrtimer: switch clock_nanosleep to hrtimer nanosleep API
Switch clock_nanosleep to use the new nanosleep functions in hrtimer.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:38 -08:00
Thomas Gleixner
6ba1b91213 [PATCH] hrtimer: switch sys_nanosleep to hrtimer
convert sys_nanosleep() to use hrtimer_nanosleep()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:38 -08:00
Thomas Gleixner
10c94ec16d [PATCH] hrtimer: create hrtimer nanosleep API
introduce the hrtimer_nanosleep() and hrtimer_nanosleep_real() APIs.  Not yet
used by any code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:38 -08:00
Thomas Gleixner
2ff678b8da [PATCH] hrtimer: switch itimers to hrtimer
switch itimers to a hrtimers-based implementation

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:38 -08:00
Thomas Gleixner
c0a3132963 [PATCH] hrtimer: hrtimer core code
hrtimer subsystem core.  It is initialized at bootup and expired by the timer
interrupt, but is otherwise not utilized by any other subsystem yet.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:37 -08:00
Thomas Gleixner
f8f46da3b4 [PATCH] hrtimer: introduce nsec_t type and conversion functions
- introduce the nsec_t type

- basic nsec conversion routines: timespec_to_ns(), timeval_to_ns(),
  ns_to_timespec(), ns_to_timeval().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:37 -08:00
Thomas Gleixner
718bcceb5a [PATCH] hrtimer: validate timespec of do_sys_settimeofday
Check if the timespec which is provided from user space is normalized.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:37 -08:00
Thomas Gleixner
5f82b2b77e [PATCH] hrtimer: create and use timespec_valid macro
add timespec_valid(ts) [returns false if the timespec is denorm]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:36 -08:00
Thomas Gleixner
a924b04dde [PATCH] hrtimer: make clockid_t arguments const
add const arguments to the posix-timers.h API functions

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:36 -08:00
Andrew Morton
199e705689 [PATCH] hrtimer: export deinlined mktime
This is now uninlined, but some modules use it.

Make it a non-GPL export, since the inlined mktime() was also available that
way.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:35 -08:00
Ingo Molnar
f4818900fa [PATCH] hrtimer: clean up mktime and make arguments const
add 'const' to mktime arguments, and clean it up a bit

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:35 -08:00
Thomas Gleixner
753be62227 [PATCH] hrtimer: deinline mktime and set_normalized_timespec
mktime() and set_normalized_timespec() are large inline functions used in many
places: deinline them.

From: George Anzinger, off-by-1 bugfix

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:35 -08:00
Thomas Gleixner
67924be886 [PATCH] hrtimer: remove duplicate div_long_long_rem implementation
make posix-timers.c use the generic calc64.h facility

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:35 -08:00
Christoph Hellwig
3a0f69d59b [PATCH] common compat_sys_timer_create
The comment in compat.c is wrong, every architecture provides a
get_compat_sigevent() for the IPC compat code already.

This basically moves the x86_64 version to common code and removes all the
others.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:32 -08:00
Vivek Goyal
4ae362be50 [PATCH] kdump: read previous kernel's memory
- Moving the crash_dump.c file to arch dependent part as kmap_atomic_pfn is
  specific to i386 and highmem may not exist in other archs.

- Use ioremap for x86_64 to map the previous kernel memory.

- In copy_oldmem_page(), we now directly copy to the user/kernel buffer and
  avoid the unneccesary copy to a kmalloc'd page.

Signed-off-by: Rachita Kothiyal <rachita@in.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:28 -08:00
Vivek Goyal
e996e58133 [PATCH] kdump: save registers early (inline functions)
- If system panics then cpu register states are captured through funciton
  crash_get_current_regs().  This is not a inline function hence a stack frame
  is pushed on to the stack and then cpu register state is captured.  Later
  this frame is popped and new frames are pushed (machine_kexec).

- In theory this is not very right as we are capturing register states for a
  frame and that frame is no more valid.  This seems to have created back
  trace problems for ppc64.

- This patch fixes it up.  The very first thing it does after entering
  crash_kexec() is to capture the register states.  Anyway we don't want the
  back trace beyond crash_kexec().  crash_get_current_regs() has been made
  inline

- crash_setup_regs() is the top architecture dependent function which should
  be responsible for capturing the register states as well as to do some
  architecture dependent tricks.  For ex.  fixing up ss and esp for i386.
  crash_setup_regs() has also been made inline to ensure no new call frame is
  pushed onto stack.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:27 -08:00
Vivek Goyal
51be5606d9 [PATCH] kdump: export per cpu crash notes pointer through sysfs
- Kexec on panic functionality allocates memory for saving cpu registers in
  case of system crash event.  Address of this allocated memory needs to be
  exported to user space, which is used by kexec-tools.

- Previously, a single /sys/kernel/crash_notes entry was being exported as
  memory allocated was a single continuous array.  Now memory allocation being
  dyanmic and per cpu based, address of per cpu buffer is exported through
  "/sys/devices/system/cpu/cpuX/crash_notes"

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:26 -08:00
Vivek Goyal
cc57165874 [PATCH] kdump: dynamic per cpu allocation of memory for saving cpu registers
- In case of system crash, current state of cpu registers is saved in memory
  in elf note format.  So far memory for storing elf notes was being allocated
  statically for NR_CPUS.

- This patch introduces dynamic allocation of memory for storing elf notes.
  It uses alloc_percpu() interface.  This should lead to better memory usage.

- Introduced based on Andi Kleen's and Eric W. Biederman's suggestions.

- This patch also moves memory allocation for elf notes from architecture
  dependent portion to architecture independent portion.  Now crash_notes is
  architecture independent.  The whole idea is that size of memory to be
  allocated per cpu (MAX_NOTE_BYTES) can be architecture dependent and
  allocation of this memory can be architecture independent.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:26 -08:00
akpm@osdl.org
ed653a6404 [PATCH] Remove set_fs() in stop_machine()
)

From: Brian Gerst <bgerst@didntduck.org>

Call sched_setscheduler() directly instead.

Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:25 -08:00
Linus Torvalds
80c0531514 Merge master.kernel.org:/pub/scm/linux/kernel/git/mingo/mutex-2.6 2006-01-09 17:31:38 -08:00
Oleg Nesterov
dbc1651f0c [PATCH] rcu: don't set ->next_pending in rcu_start_batch()
I think it is better to set ->next_pending in the caller, when
it is needed. This saves one parameter, and this coincides with
cpu_quiet() beahaviour, which sets ->completed = ->cur itself.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09 17:01:39 -08:00
Jes Sorensen
1b1dcc1b57 [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.

Modified-by: Ingo Molnar <mingo@elte.hu>

(finished the conversion)

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-09 15:59:24 -08:00
Ingo Molnar
de5097c2e7 [PATCH] mutex subsystem, more debugging code
more mutex debugging: check for held locks during memory freeing,
task exit, enable sysrq printouts, etc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09 15:59:21 -08:00
Ingo Molnar
408894ee4d [PATCH] mutex subsystem, debugging code
mutex implementation - add debugging code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09 15:59:20 -08:00
Ingo Molnar
6053ee3b32 [PATCH] mutex subsystem, core
mutex implementation, core files: just the basic subsystem, no users of it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09 15:59:19 -08:00
Linus Torvalds
6150c32589 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc-merge 2006-01-09 10:03:44 -08:00
Oleg Nesterov
677517771b [PATCH] rcu: uninline __rcu_pending()
__rcu_pending() is rather fat and called twice from rcu_pending().

rcu_pending() has multiple callers, and not that small too.

This patch uninlines both of them.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09 09:35:44 -08:00
Matt Mackall
64ca9004b8 [PATCH] Make vm86 support optional
This adds an option to remove vm86 support under CONFIG_EMBEDDED.  Saves
about 5k.

This version eliminates most of the #ifdefs of the previous version and
instead uses function stubs in vm86.h.  Also, release_vm86_irqs is moved
from asm-i386/irq.h to a more appropriate home in vm86.h so that the stubs
can live together.

$ size vmlinux-baseline vmlinux-novm86
   text    data     bss     dec     hex filename
2920821  523232  190652 3634705  377611 vmlinux-baseline
2916268  523100  190492 3629860  376324 vmlinux-novm86

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:11 -08:00
Matt Mackall
e585e47031 [PATCH] tiny: Make *[ug]id16 support optional
Configurable 16-bit UID and friends support

This allows turning off the legacy 16 bit UID interfaces on embedded platforms.

   text    data     bss     dec     hex filename
3330172  529036  190556 4049764  3dcb64 vmlinux-baseline
3328268  529040  190556 4047864  3dc3f8 vmlinux

From: Adrian Bunk <bunk@stusta.de>

    UID16 was accidentially disabled for !EMBEDDED.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:11 -08:00
Oleg Nesterov
0f59cc4a35 [PATCH] simplify k_getrusage()
Factor out common code for different RUSAGE_xxx cases.

Don't take ->sighand->siglock in RUSAGE_SELF case, suggested by Ravikiran G
Thirumalai <kiran@scalex86.org>.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:09 -08:00
Nathan Lynch
f756d5e256 [PATCH] fix workqueue oops during cpu offline
Use first_cpu(cpu_possible_map) for the single-thread workqueue case.  We
used to hardcode 0, but that broke on systems where !cpu_possible(0) when
workqueue_struct->cpu_workqueue_struct was changed from a static array to
alloc_percpu.

Commit id bce61dd49d ("Fix hardcoded cpu=0 in
workqueue for per_cpu_ptr() calls") fixed that for Ben's funky sparc64
system, but it regressed my Power5.  Offlining cpu 0 oopses upon the next
call to queue_work for a single-thread workqueue, because now we try to
manipulate per_cpu_ptr(wq->cpu_wq, 1), which is uninitialized.

So we need to establish an unchanging "slot" for single-thread workqueues
which will have a valid percpu allocation.  Since alloc_percpu keys off of
cpu_possible_map, which must not change after initialization, make this
slot == first_cpu(cpu_possible_map).

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:08 -08:00
Ashutosh Naik
eb46996f90 [PATCH] kernel/module.c: remove redundant spinlock in resolve_symbol()
Remove the redundant spinlock in the function resolve_symbol() as we are
not altering the module list, and we already hold the semaphore.

Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:04 -08:00
Akinobu Mita
fb1697933a [PATCH] modules: mark TAINT_FORCED_RMMOD correctly
Currently TAINT_FORCED_RMMOD is totally unused.  Because it is marked as
TAINT_FORCED_MODULE instead when user forced a module unload.  This patch
marks it correctly

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:03 -08:00
Ashutosh Naik
eea8b54dc0 [PATCH] modules: prevent overriding of symbols
Ensure that an exported symbol does not already exist in the kernel or in
some other module's exported symbol table.  This is done by checking the
symbol tables for the exported symbol at the time of loading the module.
Currently this is done after the relocation of the symbol.

Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com>
Signed-off-by: Anand Krishnan <anandhkrishnan@yahoo.co.in>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:03 -08:00
Oleg Nesterov
fe7d37d1fb [PATCH] copy_process: error path cleanup
This patch moves 'fork_out:' under 'bad_fork_free:', and removes now
unneeded 'if (retval)' check.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:01 -08:00
Oleg Nesterov
f7dd795e91 [PATCH] setpgid: should not accept ptraced childs
sys_setpgid() allows to change ->pgrp of ptraced childs.

'man setpgid' does not tell anything about that, so I consider
this behaviour is a bug.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:01 -08:00
Oren Laadan
e19f247a3d [PATCH] setpgid: should work for sub-threads
setsid() does not work unless the calling process is a
thread_group_leader().

'man setpgid' does not tell anything about that, so I consider this
behaviour is a bug.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:01 -08:00
Oleg Nesterov
ee0acf90d3 [PATCH] setpgid: should work for sub-threads
setpgid(0, pgid) or setpgid(forked_child_pid, pgid) does not work unless
the calling process is a thread_group_leader().

'man setpgid' does not tell anything about that, so I consider this
behaviour is a bug.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:01 -08:00
Oren Laadan
9a5d3023e6 [PATCH] fork: fix race in setting child's pgrp and tty
In fork, child should recopy parent's pgrp/tty after it has tasklist_lock.
Otherwise following a setpgid() on the parent, *after* copy_signal(), the
child will own a stale pgrp (which may be reused); (eg.  if copy_mm()
sleeps a long while due to memory pressure).  Similar issue for the tty.

Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:00 -08:00
Eric W. Biederman
5e38291d80 [PATCH] Don't attempt to power off if power off is not implemented
The problem.  It is expected that /sbin/halt -p works exactly like
/sbin/halt, when the kernel does not implement power off functionality.

The kernel can do a lot of work in the reboot notifiers and in
device_shutdown before we even get to machine_power_off.  Some of that
shutdown is not safe if you are leaving the power on, and it definitely
gets in the way of using sysrq or pressing ctrl-alt-del.  Since the
shutdown happens in generic code there is no way to fix this in
architecture specific code :(

Some machines are kernel oopsing today because of this.

The simple solution is to turn LINUX_REBOOT_CMD_POWER_OFF into
LINUX_REBOOT_CMD_HALT if power_off functionality is not implemented.

This has the unfortunate side effect of disabling the power off
functionality on architectures that leave pm_power_off to null and still
implement something in machine_power_off.  And it will break the build on
some architectures that don't have a pm_power_off variable at all.

On both counts I say tough.

For architectures like alpha that don't implement the pm_power_off variable
pm_power_off is declared in linux/pm.h and it is a generic part of our
power management code, and all architectures should implement it.

For architectures like parisc that have a default power off method in
machine_power_off if pm_power_off is not implemented or fails.  It is easy
enough to set the pm_power_off variable.  And nothing bad happens there,
the machines just stop powering off.

The current semantics are impossible without a flag at the top level so we
can avoid the problem code if a power off is not implemented.  pm_power_off
is as good a flag as any with the bonus that it works without modification
on at least x86, x86_64, powerpc, and ppc today.

Andrew can you pick this up and put this in the mm tree.  Kernels that
don't compile or don't power off seem saner than kernels that oops or
panic.  Until we get the arch specific patches for the problem
architectures this probably isn't smart to push into the stable kernel.
Unfortunately I don't have the time at the moment to walk through every
architecture and make them work.  And even if I did I couldn't test it :(

From: Hirokazu Takata <takata@linux-m32r.org>

    Add pm_power_off() for build fix of arch/m32r/kernel/process.c.

From: Miklos Szeredi <miklos@szeredi.hu>

    UML build fix

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Hayato Fujiwara <fujiwara@linux-m32r.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:00 -08:00
Srivatsa Vaddagiri
d84f520348 [PATCH] Extend RCU torture module to test tickless idle CPU
This patch forces RCU torture threads off various CPUs in the system
allowing them to become idle and go tickless.  Meant to test support for
such tickless idle CPU in RCU.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:59 -08:00
Dave Jones
9841d61d75 [PATCH] Add tainting for proprietary helper modules
Kernels that have had Windows drivers loaded into them are undebuggable.
I've wasted a number of hours chasing bugs filed in Fedora bugzilla only to
find out much later that the user had used such 'helpers', and their
problems were unreproducable without them loaded.

Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:59 -08:00
Eric Dumazet
5160ee6fc8 [PATCH] shrink dentry struct
Some long time ago, dentry struct was carefully tuned so that on 32 bits
UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple
of memory cache lines.

Then RCU was added and dentry struct enlarged by two pointers, with nice
results for SMP, but not so good on UP, because breaking the above tuning
(128 + 8 = 136 bytes)

This patch reverts this unwanted side effect, by using an union (d_u),
where d_rcu and d_child are placed so that these two fields can share their
memory needs.

At the time d_free() is called (and d_rcu is really used), d_child is known
to be empty and not touched by the dentry freeing.

Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so
the previous content of d_child is not needed if said dentry was unhashed
but still accessed by a CPU because of RCU constraints)

As dentry cache easily contains millions of entries, a size reduction is
worth the extra complexity of the ugly C union.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Paul Jackson <pj@sgi.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:58 -08:00
Oleg Nesterov
86174cdcb4 [PATCH] remove unneeded sig->curr_target recalculation
This patch removes unneeded sig->curr_target recalculation under 'if
(atomic_dec_and_test(&sig->count))' in __exit_signal().

When sig->count == 0 the signal can't be sent to this task and
next_thread(tsk) == tsk anyway.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:57 -08:00
Oleg Nesterov
485a6435ab [PATCH] little do_group_exit() cleanup
zap_other_threads() sets SIGNAL_GROUP_EXIT at the very start,
do_group_exit() doesn't need to do it.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:55 -08:00
Oleg Nesterov
0811af28ce [PATCH] kill_proc_info_as_uid: don't use hardcoded constants
Use symbolic names instead of hardcoded constants.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Harald Welte <laforge@gnumonks.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:55 -08:00
Ben Collins
676121fcb6 [PATCH] Unchecked alloc_percpu() return in __create_workqueue()
__create_workqueue() not checking return of alloc_percpu()

NULL dereference was possible.

Signed-off-by: Ben Collins <bcollins@ubuntu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:54 -08:00
George Anzinger
71fabd5e48 [PATCH] sigaction should clear all signals on SIG_IGN, not just < 32
While rooting aroung in the signal code trying to understand how to fix the
SIG_IGN ploy (set sig handler to SIG_IGN and flood system with high speed
repeating timers) I came across what, I think, is a problem in sigaction()
in that when processing a SIG_IGN request it flushes signals from 1 to
SIGRTMIN and leaves the rest.  Attempt to fix this.

Signed-off-by: George Anzinger <george@mvista.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:53 -08:00
Guillaume Chazarain
025510cd20 [PATCH] printk return value: fix it
What's the true meaning of the printk return value?  Should it include the
priority prefix length of 3?  and what about the timing information?  In
both cases it was broken:

strace -e write echo 1 > /dev/kmsg
=> write(1, "1\n", 2)                      = 5
strace -e write echo "<1>1" > /dev/kmsg
=> write(1, "<1>1\n", 5)                   = 8

The returned length was "length of input string + 3", I made it "length
of string output to the log buffer".

Note that I couldn't find any printk caller in the kernel interested by its
return value besides kmsg_write.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Acked-By: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:52 -08:00
Christoph Hellwig
6b9c7ed848 [PATCH] use ptrace_get_task_struct in various places
The ptrace_get_task_struct() helper that I added as part of the ptrace
consolidation is useful in variety of places that currently opencode it.
Switch them to the common helpers.

Add a ptrace_traceme() helper that needs to be explicitly called, and simplify
the ptrace_get_task_struct() interface.  We don't need the request argument
now, and we return the task_struct directly, using ERR_PTR() for error
returns.  It's a bit more code in the callers, but we have two sane routines
that do one thing well now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:51 -08:00
Nick Piggin
095975da26 [PATCH] rcu file: use atomic primitives
Use atomic_inc_not_zero for rcu files instead of special case rcuref.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:48 -08:00
Adrian Bunk
97a41e2612 [PATCH] kernel/: small cleanups
This patch contains the following cleanups:
- make needlessly global functions static
- every file should include the headers containing the prototypes for
  it's global functions

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:48 -08:00
Paul Jackson
03a285f580 [PATCH] cpuset: skip rcu check if task is in root cpuset
For systems that aren't using cpusets, but have them CONFIG_CPUSET enabled in
their kernel (eventually this may be most distribution kernels), this patch
removes even the minimal rcu_read_lock() from the memory page allocation path.

Actually, it removes that rcu call for any task that is in the root cpuset
(top_cpuset), which on systems not actively using cpusets, is all tasks.

We don't need the rcu check for tasks in the top_cpuset, because the
top_cpuset is statically allocated, so at no risk of being freed out from
underneath us.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:45 -08:00
Paul Jackson
7edc59628b [PATCH] cpuset: mark number_of_cpusets read_mostly
Mark cpuset global 'number_of_cpusets' as __read_mostly.

This global is accessed everytime a zone is considered in the zonelist loops
beneath __alloc_pages, looking for a free memory page.  If number_of_cpusets
is just one, then we can short circuit the mems_allowed check.

Since this global is read alot on a hot path, and written rarely, it is an
excellent candidate for __read_mostly.

Thanks to Christoph Lameter for the suggestion.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:45 -08:00
Paul Jackson
6b9c2603ce [PATCH] cpuset: use rcu directly optimization
Optimize the cpuset impact on page allocation, the most performance critical
cpuset hook in the kernel.

On each page allocation, the cpuset hook needs to check for a possible change
in the current tasks cpuset.  It can now handle the common case, of no change,
without taking any spinlock or semaphore, thanks to RCU.

Convert a spinlock on the current task to an rcu_read_lock(), saving
approximately a memory barrier and an atomic op, depending on architecture.

This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the
write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay
freeing up a detached cpuset until after any critical sections referencing
that pointer.

Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:45 -08:00
Paul Jackson
c417f0242e [PATCH] cpuset: remove test for null cpuset from alloc code path
Remove a couple of more lines of code from the cpuset hooks in the page
allocation code path.

There was a check for a NULL cpuset pointer in the routine
cpuset_update_task_memory_state() that was only needed during system boot,
after the memory subsystem was initialized, before the cpuset subsystem was
initialized, to catch a NULL task->cpuset pointer.

Add a cpuset_init_early() routine, just before the mem_init() call in
init/main.c, that sets up just enough of the init tasks cpuset structure to
render cpuset_update_task_memory_state() calls harmless.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
04c19fa6f1 [PATCH] cpuset: migrate all tasks in cpuset at once
Given the mechanism in the previous patch to handle rebinding the per-vma
mempolicies of all tasks in a cpuset that changes its memory placement, it is
now easier to handle the page migration requirements of such tasks at the same
time.

The previous code didn't actually attempt to migrate the pages of the tasks in
a cpuset whose memory placement changed until the next time each such task
tried to allocate memory.  This was undesirable, as users invoking memory page
migration exected to happen when the placement changed, not some unspecified
time later when the task needed more memory.

It is now trivial to handle the page migration at the same time as the per-vma
rebinding is done.

The routine cpuset.c:update_nodemask(), which handles changing a cpusets
memory placement ('mems') now checks for the special case of being asked to
write a placement that is the same as before.  It was harmless enough before
to just recompute everything again, even though nothing had changed.  But page
migration is a heavy weight operation - moving pages about.  So now it is
worth avoiding that if asked to move a cpuset to its current location.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
4225399a66 [PATCH] cpuset: rebind vma mempolicies fix
Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset.  The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.

This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired.  The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.

Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan.  In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock).  So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.

Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as there is only one
task involved.  It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice.  This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
202f72d5d1 [PATCH] cpuset: number_of_cpusets optimization
Easy little optimization hack to avoid actually having to call
cpuset_zone_allowed() and check mems_allowed, in the main page allocation
routine, __alloc_pages().  This saves several CPU cycles per page allocation
on systems not using cpusets.

A counter is updated each time a cpuset is created or removed, and whenever
there is only one cpuset in the system, it must be the root cpuset, which
contains all CPUs and all Memory Nodes.  In that case, when the counter is
one, all allocations are allowed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
74cb21553f [PATCH] cpuset: numa_policy_rebind cleanup
Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.

The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.

NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.

The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy.  It just looked at the tasks
mems_allowed value.  This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes.  But in an upcoming patch, the vma mempolicies
will be rebound as well.  Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.

So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.

Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
909d75a3b7 [PATCH] cpuset: implement cpuset_mems_allowed
Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
needed, to obtain the mems_allowed vector of a cpuset, and replaced the
workaround in sys_migrate_pages() to call this new method.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:44 -08:00
Paul Jackson
cf2a473c40 [PATCH] cpuset: combine refresh_mems and update_mems
The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.

That was a silly duplication of code and waste of CPU cycles on an important
code path.

Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.

Changed all callers of either routine to call the new consolidated routine.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson
b4b2641843 [PATCH] cpuset: fork hook fix
Fix obscure, never seen in real life, cpuset fork race.  The cpuset_fork()
call in fork.c was setting up the correct task->cpuset pointer after the
tasklist_lock was dropped, which briefly exposed the newly forked process with
an unsafe (copied from parent without locks or usage counter increment) cpuset
pointer.

In theory, that exposed cpuset pointer could have been pointing at a cpuset
that was already freed and removed, and in theory another task that had been
sitting on the tasklist_lock waiting to scan the task list could have raced
down the entire tasklist, found our new child at the far end, and dereferenced
that bogus cpuset pointer.

To fix, setup up the correct cpuset pointer in the new child by calling
cpuset_fork() before the new task is linked into the tasklist, and with that,
add a fork failure case, to dereference that cpuset, if the fork fails along
the way, after cpuset_fork() was called.

Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid -
the call to cpuset_exit() from a failed fork would not have PF_EXITING set.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson
59dac16fb9 [PATCH] cpuset: update_nodemask code reformat
Restructure code layout of the kernel/cpuset.c update_nodemask() routine,
removing embedded returns and nested if's in favor of goto completion labels.
This is being done in anticipation of adding more logic to this routine, which
will favor the goto style structure.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson
c5b2aff896 [PATCH] cpuset: minor spacing initializer fixes
Four trivial cpuset fixes: remove extra spaces, remove useless initializers,
mark one __read_mostly.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:43 -08:00
Paul Jackson
3e0d98b9f1 [PATCH] cpuset: memory pressure meter
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.

This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure.  It's up to the batch
manager or other user code to decide what to do about it and take action.

==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero.  So only
    systems that enable this feature will compute the metric.

Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:42 -08:00
Paul Jackson
5966514db6 [PATCH] cpuset: mempolicy one more nodemask conversion
Finish converting mm/mempolicy.c from bitmaps to nodemasks.  The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.c

Fix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:42 -08:00
Paul E. McKenney
2d89c92907 [PATCH] Simpler signal-exit concurrency handling
Some simplification in checking signal delivery against concurrent exit.
Instead of using get_task_struct_rcu(), which increments the task_struct
reference count, check the reference count after acquiring sighand lock.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:40 -08:00
Ingo Molnar
e56d090310 [PATCH] RCU signal handling
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
instead of tasklist_lock read-locked.  This is a scalability improvement on
SMP and a preemption-latency improvement under PREEMPT_RCU.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:40 -08:00
Ravikiran G Thirumalai
22fc6eccbf [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros
____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing.  It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.

However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size.  As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h.  Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

With this patch, L1_CACHE_SHIFT_MAX can be killed

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:38 -08:00
Paul Jackson
45b07ef31d [PATCH] cpusets: swap migration interface
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.

It defaults to false (file contains "0").  It can be set true by writing
"1" to the file.

If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.

Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:43 -08:00
Christoph Lameter
39743889aa [PATCH] Swap Migration V5: sys_migrate_pages interface
sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

The intent of sys_migrate is to migrate memory of a process.  A process may
have migrated to another node.  Memory was allocated optimally for the prior
context.  sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory.  Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed.  However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends.  The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
implementation.

The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset).  When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets.  The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.

Patch supports ia64, i386 and x86_64.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:42 -08:00
Christoph Lameter
15316ba81a [PATCH] add schedule_on_each_cpu()
swap migration's isolate_lru_page() currently uses an IPI to notify other
processors that the lru caches need to be drained if the page cannot be
found on the LRU.  The IPI interrupt may interrupt a processor that is just
processing lru requests and cause a race condition.

This patch introduces a new function run_on_each_cpu() that uses the
keventd() to run the LRU draining on each processor.  Processors disable
preemption when dealing the LRU caches (these are per processor) and thus
executing LRU draining from another process is safe.

Thanks to Lee Schermerhorn <lee.schermerhorn@hp.com> for finding this race
condition.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Rohit Seth
8ad4b1fb82 [PATCH] Make high and batch sizes of per_cpu_pagelists configurable
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists.  This patch makes these two
variables configurable through /proc interface.

A new tunable /proc/sys/vm/percpu_pagelist_fraction is added.  This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list.  The min value for this is 8.  It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.

The batch value of each per cpu pagelist is also updated as a result.  It
is set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Andrew Morton
9d0243bca3 [PATCH] drop-pagecache
Add /proc/sys/vm/drop_caches.  When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can.  THis
operation requires root permissions.

It won't drop dirty data, so the user should run `sync' first.

Caveats:

a) Holds inode_lock for exorbitant amounts of time.

b) Needs to be taught about NUMA nodes: propagate these all the way through
   so the discarding can be controlled on a per-node basis.

This is a debugging feature: useful for getting consistent results between
filesystem benchmarks.  We could possibly put it under a config option, but
it's less than 300 bytes.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:40 -08:00
Michael Ellerman
54c32021eb [PATCH] powerpc: Add arch-dependent copy_oldmem_page
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-01-09 14:52:35 +11:00
Arnd Bergmann
67207b9664 [PATCH] spufs: The SPU file system, base
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.

This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.

The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.

See Documentation/filesystems/spufs.txt on how to use
spufs.

Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-01-09 14:49:12 +11:00
David S. Miller
0aec63e67c [PATCH] Fix posix-cpu-timers sched_time accumulation
I've spent the past 3 days digging into a glibc testsuite failure in
current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2
timers fire too early in the second pass of this test.  The second
pass is noteworthy because it makes use of intervals, whereas the
first pass does not.

All throughout the posix-cpu-timers.c code, the calculation of the
process sched_time sum is implemented roughly as:

	unsigned long long sum;

	sum = tsk->signal->sched_time;
	t = tsk;
	do {
		sum += t->sched_time;
		t = next_thread(t);
	} while (t != tsk);

In fact this is the exact scheme used by check_process_timers().

In the case of check_process_timers(), current->sched_time has just
been updated (via scheduler_tick(), which is invoked by
update_process_times(), which subsequently invokes
run_posix_cpu_timers()) So there is no special processing necessary
wrt. that.

In other contexts, we have to allot for the fact that tsk->sched_time
might be a bit out of date if we are current.  And the
posix-cpu-timers.c code uses current_sched_time() to deal with that.

Unfortunately it does so in an erroneous and inconsistent manner in
one spot which is what results in the early timer firing.

In cpu_clock_sample_group_locked(), it does this:

		cpu->sched = p->signal->sched_time;
		/* Add in each other live thread.  */
		while ((t = next_thread(t)) != p) {
			cpu->sched += t->sched_time;
		}
		if (p->tgid == current->tgid) {
			/*
			 * We're sampling ourselves, so include the
			 * cycles not yet banked.  We still omit
			 * other threads running on other CPUs,
			 * so the total can always be behind as
			 * much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ).
			 */
			cpu->sched += current_sched_time(current);
		} else {
			cpu->sched += p->sched_time;
		}

The problem is the "p->tgid == current->tgid" test.  If "p" is
not current, and the tgids are the same, we will add the process
t->sched_time twice into cpu->sched and omit "p"'s sched_time
which is very very very wrong.

posix-cpu-timers.c has a helper function, sched_ns(p) which takes care
of this, so my fix is to use that here instead of this special tgid
test.

The fact that current can be one of the sub-threads of "p" points out
that we could make things a little bit more accurate, perhaps by using
sched_ns() on every thread we process in these loops.  It also points
out that we don't use the most accurate value for threads in the group
actively running other cpus (and this is mentioned in the comment).

But that is a future enhancement, and this fix here definitely makes
sense.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 20:23:04 -08:00
Jayachandran C
6fe2e70bbe [PATCH] kernel/module.c: removed dead code
This patch fixes an issue reported by Coverity in kernel/module.c

Error reported: Cannot reach this line of code "else return ptr;"

Patch description:
  This is the error path, so 'err' will be negative, the else case
  is not required, this patch removes it.

Signed-off-by: Jayachandran C. <c.jayachandran@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:59 -08:00
Martin Schwidefsky
347a8dc3b8 [PATCH] s390: cleanup Kconfig
Sanitize some s390 Kconfig options.  We have ARCH_S390, ARCH_S390X,
ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT.  Replace these 6 options by
S390, 64BIT and COMPAT.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:53 -08:00
Martin Schwidefsky
089545f0c7 [PATCH] s390: cputime_t fixes
There are some more places where the use of cputime_t instead of an integer
type and the associated macros is necessary for the virtual cputime accounting
on s390.  Affected are the s390 specific appldata code and BSD process
accounting.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:49 -08:00
Rafael J. Wysocki
277c6e2ad7 [PATCH] swsusp: save image header first
This makes the swsusp_info structure become the header of the image in the
literal sense (ie.  it is saved to the swap and read before any other image
data with the help of the swsusp's swap map structure, so generally it is
treated in the same way as the rest of the image).

The main thing it does is to make swsusp_header contain the offset of the swap
map used to track the image data pages rather than the offset of swsusp_info.
 Simultaneously, swsusp_info becomes the first image page written to the swap.

The other changes are generally consequences of the above with a few
exceptions (there's some consolidation in the image reading part as a few
functions turn into trivial wrappers around something else).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:43 -08:00
Rafael J. Wysocki
1adf6c8ea9 [PATCH] swsusp: improve handling of swap partitions
This changes the handling of swap partitions by swsusp to avoid locking of the
swap devices that are not used for suspend and, consequently, simplifies the
code.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:43 -08:00
Rafael J. Wysocki
ca0aec0f7a [PATCH] swsusp: make image size limit tunable
Make the suspend image size limit tunable via /sys/power/image_size.

It is necessary for systems on which there is a limited amount of swap
available for suspend.  It can also be useful for optimizing performance of
swsusp on systems with 1 GB of RAM or more.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:42 -08:00
Rafael J. Wysocki
b3a93a255e [PATCH] swsusp: limit image size
Limit the size of the suspend image to approx.  500 MB, which should
improve the overall performance of swsusp on systems with more than 1 GB of
RAM.

It introduces the constant IMAGE_SIZE that can be set to the preferred size
of the image (in MB) and modifies the memory-shrinking part of swsusp to
take this constant into account (500 is the default value of IMAGE_SIZE).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:42 -08:00
Pavel Machek
c050ca7870 [PATCH] swsusp: Drop duplicate prototypes
These two prototypes are already present in sched.h, remove duplicate
version.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:42 -08:00
Rafael J. Wysocki
e5e2fa7857 [PATCH] swsusp: fix enough_free_mem
This patch fixes a problem with the function enough_free_mem() used by
swsusp to verify if there is a sufficient number of memory pages available
to it to create and save the suspend image.

Namely, enough_free_mem() uses nr_free_pages() to obtain the number of free
memory pages, which is incorrect, because this function returns the total
number of free pages, including free highmem pages, and the highmem pages
cannot be used by swsusp for storing the image data.

The patch makes enough_free_mem() avoid counting the free highmem
pages as available to swsusp.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Rafael J. Wysocki
72a97e0839 [PATCH] swsusp: improve freeing of memory
This patch makes swsusp free only as much memory as needed to complete the
suspend and not as much as possible.   In the most of cases this should speed
up the suspend and make the system much more responsive after resume,
especially if a GUI (eg.  X Windows) is used.

If needed, the old behavior (ie to free as much memory as possible during
suspend) can be restored by unsetting FAST_FREE in power.h

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Rafael J. Wysocki
7088a5c001 [PATCH] swsusp: introduce the swap map structure
This patch introduces the swap map structure that can be used by swsusp for
keeping tracks of data pages written to the swap.   The structure itself is
described in a comment within the patch.

The overall idea is to reduce the amount of metadata written to the swap and
to write and read the image pages sequentially, in a file-alike way.  This
makes the swap-handling part of swsusp fairly independent of its
snapshot-handling part and will hopefully allow us to completely separate
these two parts in the future.

This patch is needed to remove the suspend image size limit imposed by the
limited size of the swsusp_info structure, which is essential for x86-64
systems with more than 512 MB of RAM.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Rafael J. Wysocki
f2d97f0296 [PATCH] swsusp: remove encryption
This patch removes the image encryption that is only used by swsusp instead of
zeroing the image after resume in order to prevent someone from reading some
confidential data from it in the future and it does not protect the image from
being read by an unauthorized person before resume.  The functionality it
provides should really belong to the user space and will possibly be
reimplemented after the swap-handling functionality of swsusp is moved to the
user space.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
Ivan Kokshaysky
eee45269b0 [PATCH] Alpha: convert to generic irq framework (generic part)
Thanks to Christoph for doing most of the work.

This allows automatic SMP IRQ affinity assignment other than default "all
interrupts on all CPUs" which is rather expensive.  This might be useful if
the hardware can be programmed to distribute interrupts among different
CPUs, like Alpha does.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:40 -08:00
David Howells
7ee1dd3fee [PATCH] FRV: Make futex code compilable on nommu [try #2]
Make the futex code compilable and usable on NOMMU by making the attempt to
handle page faults conditional on CONFIG_MMU.  If this is not enabled, then
we can assume that EFAULT returned from futex_atomic_op_inuser() is not
recoverable, and that the address lies outside of valid memory.

handle_mm_fault() is made to BUG if called on NOMMU without attempting to
invoke the actual handler (__handle_mm_fault).

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:33 -08:00
Andrew Morton
a576219aca [PATCH] swsusp: resume_store() retval fix
- This function returns -EINVAL all the time.  Fix.

- Decruftify it a bit too.

- Writing to it doesn't seem to do what it's suppoed to do.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:21 -08:00
Linus Torvalds
db9edfd7e3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6
Trivial manual merge fixup for usb_find_interface clashes.
2006-01-04 18:44:12 -08:00
Linus Torvalds
25c862cc9e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild 2006-01-04 16:36:52 -08:00
akpm@osdl.org
f743ca5e10 [PATCH] kobject_uevent CONFIG_NET=n fix
lib/lib.a(kobject_uevent.o)(.text+0x25f): In function `kobject_uevent':
: undefined reference to `__alloc_skb'
lib/lib.a(kobject_uevent.o)(.text+0x2a1): In function `kobject_uevent':
: undefined reference to `skb_over_panic'
lib/lib.a(kobject_uevent.o)(.text+0x31d): In function `kobject_uevent':
: undefined reference to `skb_over_panic'
lib/lib.a(kobject_uevent.o)(.text+0x356): In function `kobject_uevent':
: undefined reference to `netlink_broadcast'
lib/lib.a(kobject_uevent.o)(.init.text+0x9): In function `kobject_uevent_init':
: undefined reference to `netlink_kernel_create'
make: *** [.tmp_vmlinux1] Error 1

Netlink is unconditionally enabled if CONFIG_NET, so that's OK.

kobject_uevent.o is compiled even if !CONFIG_HOTPLUG, which is lazy.

Let's compound the sin.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:08 -08:00
Kay Sievers
312c004d36 [PATCH] driver core: replace "hotplug" by "uevent"
Leave the overloaded "hotplug" word to susbsystems which are handling
real devices. The driver core does not "plug" anything, it just exports
the state to userspace and generates events.

Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:08 -08:00
Kay Sievers
0f76e5acf9 [PATCH] add uevent_helper control in /sys/kernel/
This deprecates the /proc/sys/kernel/hotplug file, as all
this stuff should be in /sys some day, right? :)
In /sys/kernel/ we have now uevent_seqnum and uevent_helper.
The seqnum is no longer used by udev, as the version for this
kernel depends on netlink which events will never get
out-of-order.

Recent udev versions disable the /sbin/hotplug helper with
an init script, cause it leads to OOM on big boxes by running
hundreds of shells in parallel. It should be done now by:
  echo "" > /sys/kernel/uevent_helper

(Note that "-n" does not work, cause neighter proc nor sysfs
support truncate().)

Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:07 -08:00
Kay Sievers
0296b22813 [PATCH] remove CONFIG_KOBJECT_UEVENT option
It makes zero sense to have hotplug, but not the netlink
events enabled today. Remove this option and merge the
kobject_uevent.h header into the kobject.h header file.

Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-01-04 16:18:07 -08:00
Adrian Bunk
f4b09ebc8b update the email address of Randy Dunlap
This patch removes all references to the bouncing address
rddunlap@osdl.org and one dead web page from the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
2006-01-03 13:37:51 +01:00
febf7ea4be gitignore: ignore more generated files
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2006-01-03 11:35:26 +01:00
Linus Torvalds
de9e007d91 sysctl: make sure to terminate strings with a NUL
This is a slightly more complete fix for the previous minimal sysctl
string fix.  It always terminates the returned string with a NUL, even
if the full result wouldn't fit in the user-supplied buffer.

The returned length is the full untruncated length, so that you can
tell when truncation has occurred.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-31 17:00:29 -08:00
Yi Yang
82c9df8201 [PATCH] Fix false old value return of sysctl
For the sysctl syscall, if the user wants to get the old value of a
sysctl entry and set a new value for it in the same syscall, the old
value is always overwritten by the new value if the sysctl entry is of
string type and if the user sets its strategy to sysctl_string.  This
issue lies in the strategy being run twice if the strategy is set to
sysctl_string, the general strategy sysctl_string always returns 0 if
success.

Such strategy routines as sysctl_jiffies and sysctl_jiffies_ms return 1
because they do read and write for the sysctl entry.

The strategy routine sysctl_string return 0 although it actually read
and write the sysctl entry.

According to my analysis, if a strategy routine do read and write, it
should return 1, if it just does some necessary check but not read and
write, it should return 0, for example sysctl_intvec.

Signed-off-by: Yi Yang <yang.y.yi@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-30 17:22:08 -08:00
Linus Torvalds
8febdd85ad sysctl: don't overflow the user-supplied buffer with '\0'
If the string was too long to fit in the user-supplied buffer,
the sysctl layer would zero-terminate it by writing past the
end of the buffer. Don't do that.

Noticed by Yi Yang <yang.y.yi@gmail.com>

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-30 17:18:53 -08:00
Andrew Morton
8e31108b9f [PATCH] Fix memory ordering problem in wake_futex()
Fix a memory ordering problem that occurs on IA64. The "store" to q->lock_ptr
in wake_futex() can become visible before wake_up_all() clears the lock in the
futex_q.

Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-24 12:13:27 -08:00
Jason Wessel
9e28393998 [PATCH] kernel/params.c: fix sysfs access with CONFIG_MODULES=n
All the work was done to setup the file and maintain the file handles but
the access functions were zeroed out due to the #ifdef.  Removing the
#ifdef allows full access to all the parameters when CONFIG_MODULES=n.

akpm: put it back again, but use CONFIG_SYSFS instead.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-20 10:31:33 -08:00
Alexey Starikovskiy
729b4d4ce1 [ACPI] fix reboot upon suspend-to-disk
http://bugzilla.kernel.org/show_bug.cgi?id=4320

Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2005-12-15 13:28:14 -05:00
Keshavamurthy Anil S
bf8d5c52c3 [PATCH] kprobes: increment kprobe missed count for multiprobes
When multiple probes are registered at the same address and if due to some
recursion (probe getting triggered within a probe handler), we skip calling
pre_handlers and just increment nmissed field.

The below patch make sure it walks the list for multiple probes case.
Without the below patch we get incorrect results of nmissed count for
multiple probe case.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:45 -08:00
Keshavamurthy Anil S
00d7c05ab1 [PATCH] kprobes: no probes on critical path
For Kprobes critical path is the path from debug break exception handler
till the control reaches kprobes exception code.  No probes can be
supported in this path as we will end up in recursion.

This patch prevents this by moving the below function to safe __kprobes
section onto which no probes can be inserted.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:45 -08:00
Pierre Ossman
7a4ae749a4 [PATCH] Add try_to_freeze to kauditd
kauditd was causing suspends to fail because it refused to freeze.  Adding
a try_to_freeze() to its sleep loop solves the issue.

Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Acked-by: Pavel Machek <pavel@suse.cz>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:43 -08:00
Keshavamurthy Anil S
adad0f331f [PATCH] kprobes: fix race in aggregate kprobe registration
When registering multiple kprobes at the same address, we leave a small
window where the kprobe hlist will not contain a reference to the
registered kprobe, leading to potentially, a system crash if the breakpoint
is hit on another processor.

Patch below now automically relpace the old kprobe with the new
kprobe from the hash list.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:43 -08:00
Matt Helsley
64123fd42c [PATCH] Add getnstimestamp function
There are several functions that might seem appropriate for a timestamp:

get_cycles()
current_kernel_time()
do_gettimeofday()
<read jiffies/jiffies_64>

Each has problems with combinations of SMP-safety, low resolution, and
monotonicity. This patch adds a new function that returns a monotonic SMP-safe
timestamp with nanosecond resolution where available.

Changes:
	Split timestamp into separate patch
	Moved to kernel/time.c
	Renamed to getnstimestamp
	Fixed unintended-pointer-arithmetic bug

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:42 -08:00
Srivatsa Vaddagiri
c3f5902325 [PATCH] Fix RCU race in access of nohz_cpu_mask
Accessing nohz_cpu_mask before incrementing rcp->cur is racy.  It can cause
tickless idle CPUs to be included in rsp->cpumask, which will extend
graceperiods unnecessarily.

Fix this race.  It has been tested using extensions to RCU torture module
that forces various CPUs to become idle.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:42 -08:00
Srivatsa Vaddagiri
89d46b8778 [PATCH] Fix bug in RCU torture test
While doing some test of RCU torture module, I hit a OOPS in rcu_do_batch,
which was trying to processes callback of a module that was just removed.
This is because we weren't waiting long enough for all callbacks to fire.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:42 -08:00
Dipankar Sarma
ab4720ec76 [PATCH] add rcu_barrier() synchronization point
This introduces a new interface - rcu_barrier() which waits until all
the RCUs queued until this call have been completed.

Reiser4 needs this, because we do more than just freeing memory object
in our RCU callback: we also remove it from the list hanging off
super-block.  This means, that before freeing reiser4-specific portion
of super-block (during umount) we have to wait until all pending RCU
callbacks are executed.

The only change of reiser4 made to the original patch, is exporting of
rcu_barrier().

Cc: Hans Reiser <reiser@namesys.com>
Cc: Vladimir V. Saveliev <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:42 -08:00
Mao, Bibo
b3e55c727f [PATCH] Kprobes: Reference count the modules when probed on it
When a Kprobes are inserted/removed on a modules, the modules must be ref
counted so as not to allow to unload while probes are registered on that
module.

Without this patch, the probed module is free to unload, and when the
probing module unregister the probe, the kpobes code while trying to
replace the original instruction might crash.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Mao Bibo <bibo.mao@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-12-12 08:57:41 -08:00
David Shaohua Li
1a38416cea [ACPI] SMP S3 resume: evaluate _WAK after INIT
On SMP resume from S3, we reset (INIT) the non-boot
processors to boot them cleanly.  But the BIOS needs
to execute _WAK after INIT in order to properly
initialized these processors upon resume.

http://bugzilla.kernel.org/show_bug.cgi?id=5651

Signed-off-by: David Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2005-11-30 23:15:55 -05:00
Pavel Machek
123d3c13e2 [PATCH] fix swsusp on machines not supporting S4
Fix swsusp on machines not supporting S4.  With recent changes, it is not
possible to trigger it using /sys filesystem.  Swsusp does not really need
any support from low-level code, it is possible to reboot or halt at the
end of suspend.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29 19:47:03 -08:00
David Gibson
5bd0190bf3 [PATCH] Fix crash when ptrace poking hugepage areas
set_page_dirty() will not cope with being handed a page * which is part of
a compound page, but not the master page in that compound page.  This case
can occur via access_process_vm() if you attemp to write to another
process's hugepage memory area using ptrace() (causing an oops or hang).

This patch fixes the bug by only calling set_page_dirty() from
access_process_vm() if the page is not a compound page.  We already use a
similar fix in bio_set_pages_dirty() for the case of direct io to
hugepages.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29 19:47:03 -08:00
Paul Jackson
8c4b8add83 [PATCH] cpuset fork locking fix
Move the cpuset_fork() call below the write_unlock_irq call in
kernel/fork.c copy_process().

Since the cpuset-dual-semaphore-locking-overhaul.patch, the cpuset_fork()
routine acquires task_lock(), so cannot be called while holding the
tasklist_lock for write.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28 14:42:24 -08:00
Ben Collins
bce61dd49d [PATCH] Fix hardcoded cpu=0 in workqueue for per_cpu_ptr() calls
Tracked this down on an Ultra Enterprise 3000.  It's a 6-way machine.  Odd
thing about this machine (and it's good for finding bugs like this) is that
the CPU id's are not 0 based.  For instance, on my machine the CPU's are
6/7/10/11/14/15.

This caused some NULL pointer dereference in kernel/workqueue.c because for
single_threaded workqueue's, it hardcoded the cpu to 0.

I changed the 0's to any_online_cpu(cpu_online_mask), which cpumask.h
claims is "First cpu in mask".  So this fits the same usage.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28 14:42:23 -08:00
Oleg Nesterov
ee500f2749 [PATCH] fix 32bit overflow in timespec_to_sample()
fix 32bit overflow in timespec_to_sample()

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28 14:42:23 -08:00
Andrew Morton
c13cf856cb [PATCH] fork.c: proc_fork_connector() called under write_lock()
Don't do that - it does GFP_KERNEL allocations, for a start.

(Reported by Guillaume Thouvenin <guillaume.thouvenin@bull.net>)

Acked-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28 14:42:23 -08:00
Ashok Raj
a9d9baa1e8 [PATCH] clean up lock_cpu_hotplug() in cpufreq
There are some callers in cpufreq hotplug notify path that the lowest
function calls lock_cpu_hotplug().  The lock is already held during
cpu_up() and cpu_down() calls when the notify calls are broadcast to
registered clients.

Ideally if possible, we could disable_preempt() at the highest caller and
make sure we dont sleep in the path down in cpufreq->driver_target() calls
but the calls are so intertwined and cumbersome to cleanup.

Hence we consistently use lock_cpu_hotplug() and unlock_cpu_hotplug() in
all places.

 - Removed export of cpucontrol semaphore and made it static.
 - removed explicit uses of up/down with lock_cpu_hotplug()
   so we can keep track of the the callers in same thread context and
   just keep refcounts without calling a down() that causes a deadlock.
 - Removed current_in_hotplug() uses
 - Removed PF_HOTPLUG_CPU in sched.h introduced for the current_in_hotplug()
   temporary workaround.

Tested with insmod of cpufreq_stat.ko, and logical online/offline
to make sure we dont have any hang situations.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: Zwane Mwaikambo <zwane@linuxpower.ca>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28 14:42:23 -08:00
Benjamin Herrenschmidt
e9b15b54d3 [PATCH] Fix crash in unregister_console()
If unregister_console() is inadvertently called while no consoles are
registered, it will crash trying to dereference NULL pointer.  It is
necessary to fix that because register_console() provides no indication
that it actually registered the console passed in.  In fact, it may well
decide not to register it based on various things...

(akpm: It'd be better to make register_console() return something and fix the
callers.  All 106 of them...)

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23 16:08:39 -08:00
Hugh Dickins
cc3327e7df [PATCH] mm: unbloat get_futex_key
The follow_page changes in get_futex_key have left it with two almost
identical blocks, when handling the rare case of a futex in a nonlinear vma.
get_user_pages will itself do that follow_page, and its additional
find_extend_vma is hardly any overhead since the vma is already cached.  Let's
just delete the follow_page block and let get_user_pages do it.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23 16:08:38 -08:00
Matthew Wilcox
c2b5a251b9 [PATCH] Check the irq number is within bounds
Most of the functions already check. Do the ones that didn't.

Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23 11:11:28 -08:00
Hugh Dickins
0b0db14c53 [PATCH] unpaged: copy_page_range vma
For copy_one_pte's print_bad_pte to show the task correctly (instead of
"???"), dup_mmap must pass down parent vma rather than child vma.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22 09:13:43 -08:00
Paul E. McKenney
996417d2c4 [PATCH] add success/failure indication to RCU torture test
One issue with the RCU torture test is that the current error flagging can
be lost in dmesg.  This patch adds a "SUCCESS"/"FAILURE" string to the line
that flags the end of the test, where it can easily be seen with "dmesg |
tail" at the end of the test.  Also adds tests of architecture-specific
memory barriers -- or, more likely, of the RCU torture test itself.

Cc: <vatsa@in.ibm.com>
Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-18 07:49:45 -08:00
Martin Waitz
ddad86c2d6 [PATCH] DocBook: include printk documentation
Add printk documentation to kernel-api.

Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:21 -08:00
George Anzinger
3f39894d1b [PATCH] timespec: normalize off by one errors
It would appear that the timespec normalize code has an off by one error.
Found in three places.  Thanks to Ben for spotting.

Signed-off-by: George Anzinger<george@mvista.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:17 -08:00
Zach Brown
20dcae3243 [PATCH] aio: remove kioctx from mm_struct
Sync iocbs have a life cycle that don't need a kioctx.  Their retrying, if
any, is done in the context of their owner who has allocated them on the
stack.

The sole user of a sync iocb's ctx reference was aio_complete() checking for
an elevated iocb ref count that could never happen.  No path which grabs an
iocb ref has access to sync iocbs.

If we were to implement sync iocb cancelation it would be done by the owner of
the iocb using its on-stack reference.

Removing this chunk from aio_complete allows us to remove the entire kioctx
instance from mm_struct, reducing its size by a third.  On a i386 testing box
the slab size went from 768 to 504 bytes and from 5 to 8 per page.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:16 -08:00
Kirill Korotaev
4557398f8c [PATCH] stop_machine() vs. synchronous IPI send deadlock
This fixes deadlock of stop_machine() vs.  synchronous IPI send.  The
problem is that stop_machine() disables interrupts before disabling
preemption on other CPUs.  So if another CPU is preempted and then calls
something like flush_tlb_all() it will deadlock with CPU doing
stop_machine() and which can't process IPI due to disabled IRQs.

I changed stop_machine() to do the same things exactly as it does on other
CPUs, i.e.  it should disable preemption first on _all_ CPUs including
itself and only after that disable IRQs.

Signed-off-by: Kirill Korotaev <dev@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Andrey Savochkin" <saw@sawoct.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:16 -08:00
Ingo Molnar
dbdf65b1b7 [PATCH] rcutorture: renice to low priority
Make the box usable for interactive work when running the RCU torture test,
by renicing the RCU torture-test threads to +19 by default.  Kthreads run
at nice -5 by default.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:15 -08:00
Heiko Carstens
b17b0421d7 [PATCH] signal handling: revert sigkill priority fix
This patch reverts commit c33880aadd since
it's not needed anymore. As pointed out by Roland McGrath the real fix
is to deliver all signals before returning to user space.
See http://www.ussg.iu.edu/hypermail/linux/kernel/0509.2/0683.html
A fix for s390 has been merged.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:15 -08:00
Al Viro
10ebffde3d [PATCH] m68k: introduce setup_thread_stack() and end_of_stack()
encapsulates the rest of arch-dependent operations with thread_info access.
Two new helpers - setup_thread_stack() and end_of_stack().  For normal case
the former consists of copying thread_info of parent to new thread_info and
the latter returns pointer immediately past the end of thread_info.

Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:13 -08:00
Al Viro
a1261f5461 [PATCH] m68k: introduce task_thread_info
new helper - task_thread_info(task).  On platforms that have thread_info
allocated separately (i.e.  in default case) it simply returns
task->thread_info.  m68k wants (and for good reasons) to embed its thread_info
into task_struct.  So it will (in later patch) have task_thread_info() of its
own.  For now we just add a macro for generic case and convert existing
instances of its body in core kernel to uses of new macro.  Obviously safe -
all normal architectures get the same preprocessor output they used to get.

Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:13 -08:00
Bob Picco
5563e77078 [PATCH] cpuset: fix return without releasing semaphore
It is wrong to acquire the semaphore and then return from
cpuset_zone_allowed without releasing it.

Signed-off-by: Bob Picco <bob.picco@hp.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:11 -08:00
Christoph Hellwig
005f18dfd0 [PATCH] fix task_struct leak in ptrace
When ptrace_attach fails we need to drop the task_struct reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:11 -08:00
Jeff Garzik
bca73e4bf8 [PATCH] move pm_register/etc. to CONFIG_PM_LEGACY, pm_legacy.h
Since few people need the support anymore, this moves the legacy
pm_xxx functions to CONFIG_PM_LEGACY, and include/linux/pm_legacy.h.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:10 -08:00
David S. Miller
393b072587 [SPARC64]: Re-export uts_sem for solaris compat module.
Revert: b26b9bc582

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 12:47:50 -08:00
Oleg Nesterov
7ed0175a46 [PATCH] Don't auto-reap traced children
If a task is being traced we never auto-reap it even if it might look
like its parent doesn't care. The tracer obviously _does_ care.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-10 09:57:19 -08:00
Chen, Kenneth W
a47ab9371e [PATCH] optimize activate_task()
recalc_task_prio() is called from activate_task() to calculate dynamic
priority and interactive credit for the activating task.  For real-time
scheduling process, all that dynamic calculation is thrown away at the end
because rt priority is fixed.  Patch to optimize recalc_task_prio() away
for rt processes.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <piggin@cyberone.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 16:07:44 -08:00
Linus Torvalds
28d838cc4d Fix ptrace self-attach rule
Before we did CLONE_THREAD, the way to check whether we were attaching
to ourselves was to just check "current == task", but with CLONE_THREAD
we should check that the thread group ID matches instead.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 11:33:07 -08:00
Nick Piggin
64c7c8f885 [PATCH] sched: resched and cpu_idle rework
Make some changes to the NEED_RESCHED and POLLING_NRFLAG to reduce
confusion, and make their semantics rigid.  Improves efficiency of
resched_task and some cpu_idle routines.

* In resched_task:
- TIF_NEED_RESCHED is only cleared with the task's runqueue lock held,
  and as we hold it during resched_task, then there is no need for an
  atomic test and set there. The only other time this should be set is
  when the task's quantum expires, in the timer interrupt - this is
  protected against because the rq lock is irq-safe.

- If TIF_NEED_RESCHED is set, then we don't need to do anything. It
  won't get unset until the task get's schedule()d off.

- If we are running on the same CPU as the task we resched, then set
  TIF_NEED_RESCHED and no further action is required.

- If we are running on another CPU, and TIF_POLLING_NRFLAG is *not* set
  after TIF_NEED_RESCHED has been set, then we need to send an IPI.

Using these rules, we are able to remove the test and set operation in
resched_task, and make clear the previously vague semantics of
POLLING_NRFLAG.

* In idle routines:
- Enter cpu_idle with preempt disabled. When the need_resched() condition
  becomes true, explicitly call schedule(). This makes things a bit clearer
  (IMO), but haven't updated all architectures yet.

- Many do a test and clear of TIF_NEED_RESCHED for some reason. According
  to the resched_task rules, this isn't needed (and actually breaks the
  assumption that TIF_NEED_RESCHED is only cleared with the runqueue lock
  held). So remove that. Generally one less locked memory op when switching
  to the idle thread.

- Many idle routines clear TIF_POLLING_NRFLAG, and only set it in the inner
  most polling idle loops. The above resched_task semantics allow it to be
  set until before the last time need_resched() is checked before going into
  a halt requiring interrupt wakeup.

  Many idle routines simply never enter such a halt, and so POLLING_NRFLAG
  can be always left set, completely eliminating resched IPIs when rescheduling
  the idle task.

  POLLING_NRFLAG width can be increased, to reduce the chance of resched IPIs.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:33 -08:00
Con Kolivas
ede3d0fba9 [PATCH] sched: consider migration thread with smp nice
The intermittent scheduling of the migration thread at ultra high priority
makes the smp nice handling see that runqueue as being heavily loaded.  The
migration thread itself actually handles the balancing so its influence on
priority balancing should be ignored.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:32 -08:00
Con Kolivas
6dd4a85bb3 [PATCH] sched: correct smp_nice_bias
The priority biasing was off by mutliplying the total load by the total
priority bias and this ruins the ratio of loads between runqueues. This
patch should correct the ratios of loads between runqueues to be proportional
to overall load. -2nd attempt.

From: Dave Kleikamp <shaggy@austin.ibm.com>

  This patch fixes a divide-by-zero error that I hit on a two-way i386
  machine.  rq->nr_running is tested to be non-zero, but may change by the
  time it is used in the division.  Saving the value to a local variable
  ensures that the same value that is checked is used in the division.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:32 -08:00
Con Kolivas
3b0bd9bc6f [PATCH] sched: smp nice bias busy queues on idle rebalance
To intensify the 'nice' support across physical cpus on SMP we can bias the
loads on idle rebalancing. To prevent idle rebalance from trying to pull tasks
from queues that appear heavily loaded we only bias the load if there is more
than one task running.

Add some minor micro-optimisations and have only one return from __source_load
and __target_load functions.

Fix the fact that target_load was not biased by priority when type == 0.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:32 -08:00
Con Kolivas
dad1c65c80 [PATCH] sched: account rt tasks in prio_bias()
Real time tasks' effect on prio_bias should be based on their real time
priority level instead of their static_prio which is based on nice.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:32 -08:00
Con Kolivas
738a2ccbcf [PATCH] sched: change prio bias only if queued
prio_bias should only be adjusted in set_user_nice if p is actually currently
queued.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:32 -08:00
Con Kolivas
b910472dd3 [PATCH] sched: implement nice support across physical cpus on SMP
This patch implements 'nice' support across physical cpus on SMP.

It introduces an extra runqueue variable prio_bias which is the sum of the
(inverted) static priorities of all the tasks on the runqueue.

This is then used to bias busy rebalancing between runqueues to obtain good
distribution of tasks of different nice values.  By biasing the balancing only
during busy rebalancing we can avoid having any significant loss of throughput
by not affecting the carefully tuned idle balancing already in place.  If all
tasks are running at the same nice level this code should also have minimal
effect.  The code is optimised out in the !CONFIG_SMP case.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:32 -08:00
Rafael J. Wysocki
0fbeb5a45d [PATCH] swsusp: rework swsusp_suspend
This patch makes only the functions in swsusp.c call functions in snapshot.c
and not both ways.  It also moves the check for available swap out of
swsusp_suspend() which is necessary for separating the swap-handling functions
in swsusp from the core code.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:55:52 -08:00
Rafael J. Wysocki
ed14b52701 [PATCH] swsusp: simplify pagedir relocation
This patch simplifies the relocation of the page backup list (aka pagedir)
during resume.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:55:52 -08:00
Rafael J. Wysocki
054bd4c188 [PATCH] swsusp: reduce code duplication
The changes made by this patch are necessary for the pagedir relocation
simplification in the next patch.  Additionally, these changes allow us to
drop check_pagedir() and make get_safe_page() be a one-line wrapper around
alloc_image_page() (get_safe_page() goes to snapshot.c, because
alloc_image_page() is static and it does not make sense to export it).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:55:52 -08:00
Pavel Machek
969e9afd48 [PATCH] sleep: Fix oops in enter_state
If ACPI sleep is not configured, but someone still wants to run swsusp,
he'd get oops in enter_state.  This is regression since 2.6.14 and this
fixes it.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:55:50 -08:00
Anton Blanchard
3aef1bde14 [PATCH] quieten softlockup at boot
On a large SMP box we get a lot of softlockup thread XX started lines.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:55:50 -08:00
Ashok Raj
90d45d17f3 [PATCH] cpu hotplug: fix locking in cpufreq drivers
When calling target drivers to set frequency, we take cpucontrol lock.
When we modified the code to accomodate CPU hotplug, there was an attempt
to take a double lock of cpucontrol leading to a deadlock.  Since the
current thread context is already holding the cpucontrol lock, we dont need
to make another attempt to acquire it.

Now we leave a trace in current->flags indicating current thread already is
under cpucontrol lock held, so we dont attempt to do this another time.

Thanks to Andrew Morton for the beating:-)

From: Brice Goglin <Brice.Goglin@ens-lyon.org>

  Build fix

(akpm: this patch is still unpleasant.  Ashok continues to look for a cleaner
solution, doesn't he?  ;))

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:55:50 -08:00
Al Viro
330d57fb98 [PATCH] Fix sysctl unregistration oops (CVE-2005-2709)
You could open the /proc/sys/net/ipv4/conf/<if>/<whatever> file, then
wait for interface to go away, try to grab as much memory as possible in
hope to hit the (kfreed) ctl_table.  Then fill it with pointers to your
function.  Then do read from file you've opened and if you are lucky,
you'll get it called as ->proc_handler() in kernel mode.

So this is at least an Oops and possibly more.  It does depend on an
interface going away though, so less of a security risk than it would
otherwise be.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-08 17:57:30 -08:00
Linus Torvalds
d27ba47e7e Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6 2005-11-07 18:42:23 -08:00
Al Viro
7b7b1ace2d [PATCH] saner handling of auto_acct_off() and DQUOT_OFF() in umount
The way we currently deal with quota and process accounting that might
keep vfsmount busy at umount time is inherently broken; we try to turn
them off just in case (not quite correctly, at that) and

  a) pray umount doesn't fail (otherwise they'll stay turned off)
  b) pray nobody doesn anything funny just as we turn quota off

Moreover, LSM provides hooks for doing the same sort of broken logics.

The proper way to deal with that is to introduce the second kind of
reference to vfsmount.  Semantics:

 - when the last normal reference is dropped, all special ones are
   converted to normal ones and if there had been any, cleanup is done.
 - normal reference can be cloned into a special one
 - special reference can be converted to normal one; that's a no-op if
   we'd already passed the point of no return (i.e.  mntput() had
   converted special references to normal and started cleanup).

The way it works: e.g. starting process accounting converts the vfsmount
reference pinned by the opened file into special one and turns it back
to normal when it gets shut down; acct_auto_close() is done when no
normal references are left.  That way it does *not* obstruct umount(2)
and it silently gets turned off when the last normal reference to
vfsmount is gone.  Which is exactly what we want...

The same should be done by LSM module that holds some internal
references to vfsmount and wants to shut them down on umount - it should
make them special and security_sb_umount_close() will be called exactly
when the last normal reference to vfsmount is gone.

quota handling is even simpler - we don't use normal file IO anymore, so
there's no need to hold vfsmounts at all.  DQUOT_OFF() is done from
deactivate_super(), where it really belongs.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 18:18:09 -08:00
Hugh Dickins
dedeb0029b [SPARC64] mm: context switch ptlock
sparc64 is unique among architectures in taking the page_table_lock in
its context switch (well, cris does too, but erroneously, and it's not
yet SMP anyway).

This seems to be a private affair between switch_mm and activate_mm,
using page_table_lock as a per-mm lock, without any relation to its uses
elsewhere.  That's fine, but comment it as such; and unlock sooner in
switch_mm, more like in activate_mm (preemption is disabled here).

There is a block of "if (0)"ed code in smp_flush_tlb_pending which would
have liked to rely on the page_table_lock, in switch_mm and elsewhere;
but its comment explains how dup_mmap's flush_tlb_mm defeated it.  And
though that could have been changed at any time over the past few years,
now the chance vanishes as we push the page_table_lock downwards, and
perhaps split it per page table page.  Just delete that block of code.

Which leaves the mysterious spin_unlock_wait(&oldmm->page_table_lock)
in kernel/fork.c copy_mm.  Textual analysis (supported by Nick Piggin)
suggests that the comment was written by DaveM, and that it relates to
the defeated approach in the sparc64 smp_flush_tlb_pending.  Just delete
this block too.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:09:01 -08:00
Adrian Bunk
b26b9bc582 [PATCH] unexport uts_sem
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
Adrian Bunk
4664957b8e [PATCH] unexport idle_cpu
I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
Adrian Bunk
47bdfb96de [PATCH] unexport console_unblank
I didn't find any possible modular usage of console_unblank in the
kernel.

This patch was already ACK'ed by Alan Cox.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:54:07 -08:00
Randy Dunlap
b8887e6e8c [PATCH] kernel-docs: fix kernel-doc format problems
Convert to proper kernel-doc format.

Some have extra blank lines (not allowed immed.  after the function name)
or need blank lines (after all parameters).  Function summary must be only
one line.

Colon (":") in a function description does weird things (causes kernel-doc
to think that it's a new section head sadly).

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:55 -08:00
Randy Dunlap
1e5d533142 [PATCH] more kernel-doc cleanups, additions
Various core kernel-doc cleanups:
- add missing function parameters in ipc, irq/manage, kernel/sys,
  kernel/sysctl, and mm/slab;
- move description to just above function for kernel_restart()

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:55 -08:00
Ananth N Mavinakayanahalli
d217d5450f [PATCH] Kprobes: preempt_disable/enable() simplification
Reorganize the preempt_disable/enable calls to eliminate the extra preempt
depth.  Changes based on Paul McKenney's review suggestions for the kprobes
RCU changeset.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:46 -08:00
Ananth N Mavinakayanahalli
3516a46042 [PATCH] Kprobes: Use RCU for (un)register synchronization - base changes
Changes to the base kprobes infrastructure to use RCU for synchronization
during kprobe registration and unregistration.  These changes coupled with the
arch kprobe changes (next in series):

a. serialize registration and unregistration of kprobes.
b. enable lockless execution of handlers. Handlers can now run in parallel.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:46 -08:00
Ananth N Mavinakayanahalli
e65845235c [PATCH] Kprobes: Track kprobe on a per_cpu basis - base changes
Changes to the base kprobe infrastructure to track kprobe execution on a
per-cpu basis.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:45 -08:00
Christoph Hellwig
481bed4542 [PATCH] consolidate sys_ptrace()
The sys_ptrace boilerplate code (everything outside the big switch
statement for the arch-specific requests) is shared by most architectures.
This patch moves it to kernel/ptrace.c and leaves the arch-specific code as
arch_ptrace.

Some architectures have a too different ptrace so we have to exclude them.
They continue to keep their implementations.  For sh64 I had to add a
sh64_ptrace wrapper because it does some initialization on the first call.
For um I removed an ifdefed SUBARCH_PTRACE_SPECIAL block, but
SUBARCH_PTRACE_SPECIAL isn't defined anywhere in the tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-By: David Howells <dhowells@redhat.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:42 -08:00
Tim Schmielau
8c65b4a604 [PATCH] fix remaining missing includes
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch.  This should now allow not to include sched.h
from module.h, which is done by a followup patch.

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:41 -08:00
David Gibson
796f8d9b98 [PATCH] FUTEX_WAKE_OP: enhanced error handling
The code for FUTEX_WAKE_OP calls an arch callback,
futex_atomic_op_inuser().  That callback can return an error code, but
currently the caller assumes any error is EFAULT, and will try various
things to resolve the fault before eventually giving up with EFAULT
(regardless of the original error code).  This is not a theoretical case -
arch callbacks currently return -ENOSYS if the opcode they are given is
bogus.

This patch alters the code to detect non-EFAULT errors and return them
directly to the user.

Of course, whether -ENOSYS is the correct return value for the bogus opcode
case, or whether EINVAL would be more appropriate is another question.

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jamie Lokier <jamie@shareable.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:38 -08:00
Zach Brown
d55b5fdaf4 [PATCH] aio: remove aio_max_nr accounting race
AIO was adding a new context's max requests to the global total before
testing if that resulting total was over the global limit.  This let
innocent tasks get their new limit tested along with a racing guilty task
that was crossing the limit.  This serializes the _nr accounting with a
spinlock It also switches to using unsigned long for the global totals.
Individual contexts are still limited to an unsigned int's worth of
requests by the syscall interface.

The problem and fix were verified with a simple program that spun creating
and destroying a context while holding on to another long lived context.
Before the patch a task creating a tiny context could get a spurious EAGAIN
if it raced with a task creating a very large context that overran the
limit.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:38 -08:00
Matt Helsley
9f46080c41 [PATCH] Process Events Connector
This patch adds a connector that reports fork, exec, id change, and exit
events for all processes to userspace.  It replaces the fork_advisor patch
that ELSA is currently using.  Applications that may find these events
useful include accounting/auditing (e.g.  ELSA), system activity monitoring
(e.g.  top), security, and resource management (e.g.  CKRM).

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:35 -08:00
Pavel Machek
47b90ffe5c [PATCH] swsusp: remove unused variable
Remove unused variable, and make code less evil that way.  Fix whitespace
around for-loop-like macro.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:29 -08:00
Pavel Machek
dc19d507b1 [PATCH] swsusp cleanups
This cleans spaces between * and pointer up, and adds "int" in "unsigned
int".

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:29 -08:00
Heiko Carstens
a4c4af7c8d [PATCH] cpu hoptlug: avoid usage of smp_processor_id() in preemptible code
Replace smp_processor_id() with any_online_cpu(cpu_online_map) in order to
avoid lots of "BUG: using smp_processor_id() in preemptible [00000001]
code:..." messages in case taking a cpu online fails.

All the traces start at the last notifier_call_chain(...) in kernel/cpu.c.
Since we hold the cpu_control semaphore it shouldn't be any problem to access
cpu_online_map.

The reason why cpu_up failed is simply that the cpu that was supposed to be
taken online wasn't even there.  That is because on s390 we never know when a
new cpu comes and therefore cpu_possible_map consists of only ones and doesn't
reflect reality.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:29 -08:00
Andrew Morton
7fd93cf30c [PATCH] posix-timers `unlikely' rejig
!unlikely(expr) hurts my brain.   likely(!expr) is more straightforward.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:24 -08:00
Oleg Nesterov
889dfafe83 [PATCH] improve scheduler fairness a bit
Do not transfer remaining time slice to another cpu on process exit.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-04 10:45:28 -08:00
Paul Mackerras
23fd07750a Merge ../linux-2.6 by hand 2005-10-31 13:37:12 +11:00
Tim Schmielau
4e57b68178 [PATCH] fix missing includes
I recently picked up my older work to remove unnecessary #includes of
sched.h, starting from a patch by Dave Jones to not include sched.h
from module.h. This reduces the number of indirect includes of sched.h
by ~300. Another ~400 pointless direct includes can be removed after
this disentangling (patch to follow later).
However, quite a few indirect includes need to be fixed up for this.

In order to feed the patches through -mm with as little disturbance as
possible, I've split out the fixes I accumulated up to now (complete for
i386 and x86_64, more archs to follow later) and post them before the real
patch.  This way this large part of the patch is kept simple with only
adding #includes, and all hunks are independent of each other.  So if any
hunk rejects or gets in the way of other patches, just drop it.  My scripts
will pick it up again in the next round.

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:32 -08:00
Paul E. McKenney
b0423a0d9c [PATCH] Remove duplicate code in signal.c
Combine a bit of redundant code between force_sig_info() and
force_sig_specific().

Signed-off-by: paulmck@us.ibm.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:32 -08:00
Oleg Nesterov
ae6866c377 [PATCH] remove unneeded SI_TIMER checks
This patch removes checks for ->si_code == SI_TIMER from send_signal,
specific_send_sig_info, __group_send_sig_info.

I think posix-timers.c used these functions some time ago, now it sends
signals via send_{,group_}sigqueue, so these hooks are unneeded.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:32 -08:00
Oleg Nesterov
621d31219d [PATCH] cleanup the usage of SEND_SIG_xxx constants
This patch simplifies some checks for magic siginfo values.  It should not
change the behaviour in any way.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:31 -08:00
Oleg Nesterov
b67a1b9e4b [PATCH] remove hardcoded SEND_SIG_xxx constants
This patch replaces hardcoded SEND_SIG_xxx constants with
their symbolic names.

No changes in affected .o files.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:31 -08:00
Paul Jackson
4098f9918e [PATCH] sched: hardcode non-smp set_cpus_allowed
Simplify the UP (1 CPU) implementatin of set_cpus_allowed.

The one CPU is hardcoded to be cpu 0 - so just test for that bit, and avoid
having to pick up the cpu_online_map.

Also, unexport cpu_online_map: it was only needed for set_cpus_allowed().

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:28 -08:00
Roland McGrath
708f430dcc [PATCH] posix-cpu-timers: fix overrun reporting
This change corrects an omission in posix_cpu_timer_schedule, so that it
correctly propagates the overrun calculation to where it will get reported
to the user.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:27 -08:00
Paul E. McKenney
a241ec65ae [PATCH] RCU torture-testing kernel module
This patch is a rewrite of the one submitted on October 1st, using modules
(http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2).

This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an
intense torture test of the RCU infratructure.  This is needed due to the
continued changes to the RCU infrastructure to accommodate dynamic ticks,
CPU hotplug, realtime, and so on.  Most of the code is in a separate file
that is compiled only if the CONFIG variable is set.  Documentation on how
to run the test and interpret the output is also included.

This code has been tested on i386 and ppc64, and an earlier version of the
code has received extensive testing on a number of architectures as part of
the PREEMPT_RT patchset.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:27 -08:00
Thomas Gleixner
ecea8d19c9 [PATCH] jiffies_64 cleanup
Define jiffies_64 in kernel/timer.c rather than having 24 duplicated
defines in each architecture.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:25 -08:00
Roland McGrath
7f2a525559 [PATCH] wait4 PTRACE_ATTACH race fix
Back about a year ago when I last fiddled heavily with the do_wait code, I
was thinking too hard about the wrong thing and I now think I introduced a
bug whose inverse thought I was fixing.

Apparently noone was looking too hard over much shoulder, so as to cite my
bogus reasoning at the time.  In the race condition when PTRACE_ATTACH is
about to steal a child and then the child hits a tracing event (what
my_ptrace_child checks for), the real parent does need to set its flag
noting it has some eligible live children.  Otherwise a spurious ECHILD
error is possible, since the child in question is not yet on the
ptrace_children list.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:24 -08:00
Coywolf Qi Hunt
7407251a0e [PATCH] PF_DEAD cleanup
The PF_DEAD setting doesn't belong to exit_notify(), move it to a proper
place.

Signed-off-by: Coywolf Qi Hunt <qiyong@fc-cn.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:23 -08:00
Jesper Juhl
40dc565122 [PATCH] cleanup for kernel/printk.c
- Removes some trailing whitespace

- Breaks long lines and make other small changes to conform to CodingStyle

- Add explicit printk loglevels in two places.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:23 -08:00
David Howells
20e1129ab8 [PATCH] Keys: Get rid of warning in kmod.c if keys disabled
The attached patch gets rid of a "statement without effect" warning when
CONFIG_KEYS is disabled by making use of the return value of key_get().
The compiler will optimise all of this away when keys are disabled.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:23 -08:00
Andrea Arcangeli
30e0fca6c1 [PATCH] ptrace/coredump/exit_group deadlock
I could seldom reproduce a deadlock with a task not killable in T state
(TASK_STOPPED, not TASK_TRACED) by attaching a NPTL threaded program to
gdb, by segfaulting the task and triggering a core dump while some other
task is executing exit_group and while one task is in ptrace_attached
TASK_STOPPED state (not TASK_TRACED yet).  This originated from a gdb
bugreport (the fact gdb was segfaulting the task wasn't a kernel bug), but
I just incidentally noticed the gdb bug triggered a real kernel bug as
well.

Most threads hangs in exit_mm because the core_dumping is still going, the
core dumping hangs because the stopped task doesn't exit, the stopped task
can't wakeup because it has SIGNAL_GROUP_EXIT set, hence the deadlock.

To me it seems that the problem is that the force_sig_specific(SIGKILL) in
zap_threads is a noop if the task has PF_PTRACED set (like in this case
because gdb is attached).  The __ptrace_unlink does nothing because the
signal->flags is set to SIGNAL_GROUP_EXIT|SIGNAL_STOP_DEQUEUED (verified).

The above info also shows that the stopped task hit a race and got the stop
signal (presumably by the ptrace_attach, only the attach, state is still
TASK_STOPPED and gdb hangs waiting the core before it can set it to
TASK_TRACED) after one of the thread invoked the core dump (it's the core
dump that sets signal->flags to SIGNAL_GROUP_EXIT).

So beside the fact nobody would wakeup the task in __ptrace_unlink (the
state is _not_ TASK_TRACED), there's a secondary problem in the signal
handling code, where a task should ignore the ptrace-sigstops as long as
SIGNAL_GROUP_EXIT is set (or the wakeup in __ptrace_unlink path wouldn't be
enough).

So I attempted to make this patch that seems to fix the problem.  There
were various ways to fix it, perhaps you prefer a different one, I just
opted to the one that looked safer to me.

I also removed the clearing of the stopped bits from the zap_other_threads
(zap_other_threads was safe unlike zap_threads).  I don't like useless
code, this whole NPTL signal/ptrace thing is already unreadable enough and
full of corner cases without confusing useless code into it to make it even
less readable.  And if this code is really needed, then you may want to
explain why it's not being done in the other paths that sets
SIGNAL_GROUP_EXIT at least.

Even after this patch I still wonder who serializes the read of
p->ptrace in zap_threads.

Patch is called ptrace-core_dump-exit_group-deadlock-1.

This was the trace I've got:

test          T ffff81003e8118c0     0 14305      1         14311 14309 (NOTLB)
ffff810058ccdde8 0000000000000082 000001f4000037e1 ffff810000000013
       00000000000000f8 ffff81003e811b00 ffff81003e8118c0 ffff810011362100
       0000000000000012 ffff810017ca4180
Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff80141677>{finish_stop+87}
       <ffffffff8014367f>{get_signal_to_deliver+1359} <ffffffff8010d3ad>{do_signal+157}
       <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80111575>{sys_ptrace+2293}
       <ffffffff80131810>{default_wake_function+0} <ffffffff80196399>{sys_ioctl+73}
       <ffffffff8010dd27>{sysret_signal+28} <ffffffff8010e00f>{ptregscall_common+103}

test          D ffff810011362100     0 14309      1         14305 14312 (NOTLB)
ffff810053c81cf8 0000000000000082 0000000000000286 0000000000000001
       0000000000000195 ffff810011362340 ffff810011362100 ffff81002e338040
       ffff810001e0ca80 0000000000000001
Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173}
       <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149}
       <ffffffff801381af>{do_exit+479} <ffffffff80138d0c>{do_group_exit+252}
       <ffffffff801436db>{get_signal_to_deliver+1451} <ffffffff8010d3ad>{do_signal+157}
       <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2

       <ffffffff8014208a>{force_sig_info+186} <ffffffff804479a0>{do_int3+112}
       <ffffffff8010e308>{retint_signal+61}
test          D ffff81002e338040     0 14311      1         14716 14305 (NOTLB)
ffff81005ca8dcf8 0000000000000082 0000000000000286 0000000000000001
       0000000000000120 ffff81002e338280 ffff81002e338040 ffff8100481cb740
       ffff810001e0ca80 0000000000000001
Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173}
       <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149}
       <ffffffff801381af>{do_exit+479} <ffffffff80142d0e>{__dequeue_signal+558}
       <ffffffff80138d0c>{do_group_exit+252} <ffffffff801436db>{get_signal_to_deliver+1451}
       <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222}
       <ffffffff80140850>{specific_send_sig_info+208} <ffffffff8014208a>{force_sig_info+186}
       <ffffffff804479a0>{do_int3+112} <ffffffff8010e308>{retint_signal+61}

test          D ffff810017ca4180     0 14312      1         14309 13882 (NOTLB)
ffff81005d15fcb8 0000000000000082 ffff81005d15fc58 ffffffff80130816
       0000000000000897 ffff810017ca43c0 ffff810017ca4180 ffff81003e8118c0
       0000000000000082 ffffffff801317ed
Call Trace:<ffffffff80130816>{activate_task+150} <ffffffff801317ed>{try_to_wake_up+893}
       <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0}
       <ffffffff8018cdc3>{do_coredump+819} <ffffffff80445f52>{thread_return+82}
       <ffffffff801436d4>{get_signal_to_deliver+1444} <ffffffff8010d3ad>{do_signal+157}
       <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2

       <ffffffff804472e5>{_spin_unlock_irqrestore+5} <ffffffff8014208a>{force_sig_info+186}
       <ffffffff804476ff>{do_general_protection+159} <ffffffff8010e308>{retint_signal+61}

Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:22 -08:00
Paul Jackson
68860ec10b [PATCH] cpusets: automatic numa mempolicy rebinding
This patch automatically updates a tasks NUMA mempolicy when its cpuset
memory placement changes.  It does so within the context of the task,
without any need to support low level external mempolicy manipulation.

If a system is not using cpusets, or if running on a system with just the
root (all-encompassing) cpuset, then this remap is a no-op.  Only when a
task is moved between cpusets, or a cpusets memory placement is changed
does the following apply.  Otherwise, the main routine below,
rebind_policy() is not even called.

When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
essential role of cpusets is to place jobs (several related tasks) on a set
of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
manage a jobs processor placement within its allowed cpuset, and the
essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
memory placement within its allowed cpuset.

However, CPU affinity and NUMA memory placement are managed within the
kernel using absolute system wide numbering, not cpuset relative numbering.

This is ok until a job is migrated to a different cpuset, or what's the
same, a jobs cpuset is moved to different CPUs and Memory Nodes.

Then the CPU affinity and NUMA memory placement of the tasks in the job
need to be updated, to preserve their cpuset-relative position.  This can
be done for CPU affinity using sched_setaffinity() from user code, as one
task can modify anothers CPU affinity.  This cannot be done from an
external task for NUMA memory placement, as that can only be modified in
the context of the task using it.

However, it easy enough to remap a tasks NUMA mempolicy automatically when
a task is migrated, using the existing cpuset mechanism to trigger a
refresh of a tasks memory placement after its cpuset has changed.  All that
is needed is the old and new nodemask, and notice to the task that it needs
to rebind its mempolicy.  The tasks mems_allowed has the old mask, the
tasks cpuset has the new mask, and the existing
cpuset_update_current_mems_allowed() mechanism provides the notice.  The
bitmap/cpumask/nodemask remap operators provide the cpuset relative
calculations.

This patch leaves open a couple of issues:

 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:

    These mempolicies may reference nodes outside of those allowed to
    the current task by its cpuset.  Tasks are migrated as part of jobs,
    which reside on what might be several cpusets in a subtree.  When such
    a job is migrated, all NUMA memory policy references to nodes within
    that cpuset subtree should be translated, and references to any nodes
    outside that subtree should be left untouched.  A future patch will
    provide the cpuset mechanism needed to mark such subtrees.  With that
    patch, we will be able to correctly migrate these other memory policies
    across a job migration.

 2) Updating cpuset, affinity and memory policies in user space:

    This is harder.  Any placement state stored in user space using
    system-wide numbering will be invalidated across a migration.  More
    work will be required to provide user code with a migration-safe means
    to manage its cpuset relative placement, while preserving the current
    API's that pass system wide numbers, not cpuset relative numbers across
    the kernel-user boundary.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:22 -08:00
Paul Jackson
18a19cb304 [PATCH] cpusets: simple rename
Add support for renaming cpusets.  Only allow simple rename of cpuset
directories in place.  Don't allow moving cpusets elsewhere in hierarchy or
renaming the special cpuset files in each cpuset directory.

The usefulness of this simple rename became apparent when developing task
migration facilities.  It allows building a second cpuset hierarchy using
new names and containing new CPUs and Memory Nodes, moving tasks from the
old to the new cpusets, removing the old cpusets, and then renaming the new
cpusets to be just like the old names, so that any knowledge that the tasks
had of their cpuset names will still be valid.

Leaf node cpusets can be migrated to other CPUs or Memory Nodes by just
updating their 'cpus' and 'mems' files, but because no cpuset can contain
CPUs or Nodes not in its parent cpuset, one cannot do this in a cpuset
hierarchy without first expanding all the non-leaf cpusets to contain the
union of both the old and new CPUs and Nodes, which would obfuscate the
one-to-one migration of a task from one cpuset to another required to
correctly migrate the physical page frames currently allocated to that
task.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:21 -08:00
Paul Jackson
053199edf5 [PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking.  Replace single semaphore with two semaphores.

The suggestion to use two locks was made by Roman Zippel.

Both locks are global.  Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers.  Before making actual
changes, the second semaphore, callback_sem must be acquired as well.  Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.

The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.

This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention.  Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)

This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman).  I've never
seen these races fail in any use or test.

See further the comments in the code.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:21 -08:00
Paul Jackson
5aa15b5f27 [PATCH] cpusets: remove depth counted locking hack
Remove a rather hackish depth counter on cpuset locking.  The depth counter
was avoiding a possible double trip on the global cpuset_sem semaphore.  It
worked, but now an improved version of cpuset locking is available, to come
in the next patch, using two global semaphores.

This patch reverses "cpuset semaphore depth check deadlock fix"

The kernel still works, even after this patch, except for some rare and
difficult to reproduce race conditions when agressively creating and
destroying cpusets marked with the notify_on_release option, on very large
systems.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:21 -08:00
Paul Jackson
f35f31d7ed [PATCH] cpuset cleanup
Remove one more useless line from cpuset_common_file_read().

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:21 -08:00
Oleg Nesterov
4eb9af2a8a [PATCH] posix-timers: use schedule_timeout() in common_nsleep()
common_nsleep() reimplements schedule_timeout_interruptible() for unknown
reason.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:20 -08:00
Vadim Lobanov
6dd69f1061 [PATCH] Unify sys_tkill() and sys_tgkill()
The majority of the sys_tkill() and sys_tgkill() function code is
duplicated between the two of them.  This patch pulls the duplication out
into a separate function -- do_tkill() -- and lets sys_tkill() and
sys_tgkill() be simple wrappers around it.  This should make it easier to
maintain in light of future changes.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:19 -08:00
Oleg Nesterov
19a4fcb531 [PATCH] kill sigqueue->lock
This lock is used in sigqueue_free(), but it is always equal to
current->sighand->siglock, so we don't need to keep it in the struct
sigqueue.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:19 -08:00
Andrew Morton
dfc4f94d2f [PATCH] remove timer debug field
Remove timer_list.magic and associated debugging code.

I originally added this when a spinlock was added to timer_list - this meant
that an all-zeroes timer became illegal and init_timer() was required.

That spinlock isn't even there any more, although timer.base must now be
initialised.

I'll keep this debugging code in -mm.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:18 -08:00
Christoph Lameter
89ada67917 [PATCH] Use alloc_percpu to allocate workqueues locally
This patch makes the workqueus use alloc_percpu instead of an array.  The
workqueues are placed on nodes local to each processor.

The workqueue structure can grow to a significant size on a system with
lots of processors if this patch is not applied.  64 bit architectures with
all debugging features enabled and configured for 512 processors will not
be able to boot without this patch.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:18 -08:00
Andrew Morton
a5a0d52c73 [PATCH] ntp whitespace cleanup
Fix bizarre 4-space coding style in the NTP code.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:18 -08:00
john stultz
1bb34a4127 [PATCH] NTP shift_right cleanup
Create a macro shift_right() that avoids the numerous ugly conditionals in the
NTP code that look like:

        if(a < 0)
                b = -(-a >> shift);
        else
                b = a >> shift;

Replacing it with:

        b = shift_right(a, shift);

This should have zero effect on the logic, however it should probably have
a bit of testing just to be sure.

Also replace open-coded min/max with the macros.

Signed-off-by : John Stultz <johnstul@us.ibm.com>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:18 -08:00
Alan Stern
61e1a9ea4b [PATCH] Add kthread_stop_sem()
Enhance the kthread API by adding kthread_stop_sem, for use in stopping
threads that spend their idle time waiting on a semaphore.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:17 -08:00
Oleg Nesterov
a8db2db1e6 [PATCH] introduce setup_timer() helper
Every user of init_timer() also needs to initialize ->function and ->data
fields.  This patch adds a simple setup_timer() helper for that.

The schedule_timeout() is patched as an example of usage.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:17 -08:00
Shaohua Li
eb9289eb20 [PATCH] introduce .valid callback for pm_ops
Add pm_ops.valid callback, so only the available pm states show in
/sys/power/state.  And this also makes an earlier states error report at
enter_state before we do actual suspend/resume.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Acked-by: Pavel Machek<pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:15 -08:00
Rafael J. Wysocki
0245b3e787 [PATCH] swsusp: two simplifications
The following patch simplifies the progress meter in disk.c:free_some_memory()
and makes disk.c:pm_suspend_disk() call device_resume() explicitly in the
suspend path.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:15 -08:00
Rafael J. Wysocki
2e32a43efd [PATCH] swsusp: get rid of unnecessary wrapper function
The following patch merges two functions in a trivial way.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:15 -08:00
Pavel Machek
de491861e1 [PATCH] swsusp: cleanups
Reduce number of ifdefs somehow, and fix whitespace a bit.  No real code
changes.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Pavel Machek
96bc7aec20 [PATCH] swsusp: remove unneccessary includes
Cleanup comments and remove unneccessary includes.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Rafael J. Wysocki
2c1b4a5ca4 [PATCH] swsusp: rework memory freeing on resume
The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to
mark pages that should be freed in case of an error during resume.

This allows us to simplify the code and to use swsusp_free() in all of the
swsusp's resume error paths, which makes them actually work.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Rafael J. Wysocki
a0f496517f [PATCH] swsusp: reduce the use of global variables
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Rafael J. Wysocki
25761b6eb7 [PATCH] swsusp: move snapshot functionality to separate file
The following patch moves the functionality of swsusp related to creating and
handling the snapshot of memory to a separate file, snapshot.c

This should enable us to untangle the code in the future and eventually to
implement some parts of swsusp.c in the user space.

The patch does not change the code.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Rafael J. Wysocki
351619baf9 [PATCH] swsusp: rework image freeing
The following patch makes swsusp use PG_nosave and PG_nosave_free flags to
mark pages that should be freed after the state of the system has been
restored from the image (or in case of an error during suspend).

This allows us to avoid storing metadata in swap twice and to reduce the
amount of memory needed by swsusp.  Additionally, it allows us to simplify
the code by removing a couple of functions that are no longer necessary.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Ashok Raj
c32b6b8e52 [PATCH] create and destroy cpufreq sysfs entries based on cpu notifiers
cpufreq entries in sysfs should only be populated when CPU is online state.
 When we either boot with maxcpus=x and then boot the other cpus by echoing
to sysfs online file, these entries should be created and destroyed when
CPU_DEAD is notified.  Same treatement as cache entries under sysfs.

We place the processor in the lowest frequency, so hw managed P-State
transitions can still work on the other threads to save power.

Primary goal was to just make these directories appear/disapper dynamically.

There is one in this patch i had to do, which i really dont like myself but
probably best if someone handling the cpufreq infrastructure could give
this code right treatment if this is not acceptable.  I guess its probably
good for the first cut.

- Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt.
  The locking was smack in the middle of the notification path, when the
  hotplug is already holding the lock. I tried another solution to avoid this
  so avoid taking locks if we know we are from notification path. The solution
  was getting very ugly and i decided this was probably good for this iteration
  until someone who understands cpufreq could do a better job than me.

(akpm: export cpucontrol to GPL modules: drivers/cpufreq/cpufreq_stats.c now
does lock_cpu_hotplug())

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Zwane Mwaikambo <zwane@holomorphy.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Brian Gerst
4276d32260 [PATCH] Remove redundant configs.o
Since CONFIG_IKCONFIG_PROC already depends on CONFIG_IKCONFIG, adding
configs.o again is redundant.

Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:10 -08:00
Hugh Dickins
4c21e2f244 [PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.

This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)

In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.

Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.

There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:42 -07:00
Hugh Dickins
deceb6cd17 [PATCH] mm: follow_page with inner ptlock
Final step in pushing down common core's page_table_lock.  follow_page no
longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
and so no page_table_lock is taken in get_user_pages itself.

But get_user_pages (and get_futex_key) do then need follow_page to pin the
page for them: take Daniel's suggestion of bitflags to follow_page.

Need one for WRITE, another for TOUCH (it was the accessed flag before:
vanished along with check_user_page_readable, but surely get_numa_maps is
wrong to mark every page it finds as accessed), another for GET.

And another, ANON to dispose of untouched_anonymous_page: it seems silly for
that to descend a second time, let follow_page observe if there was no page
table and return ZERO_PAGE if so.  Fix minor bug in that: check VM_LOCKED -
make_pages_present ought to make readonly anonymous present.

Give get_numa_maps a cond_resched while we're there.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:41 -07:00
Hugh Dickins
c74df32c72 [PATCH] mm: ptd_alloc take ptlock
Second step in pushing down the page_table_lock.  Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code.  But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate.  It appears that on
all arches we can safely descend without page_table_lock.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00
Hugh Dickins
365e9c87a9 [PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability.  Originally it was called whenever rss or
total_vm got raised.  Then many of those callsites were replaced by a timer
tick call from account_system_time.  Now Frank van Maarseveen reports that to
be found inadequate.  How about this?  Works for Frank.

Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths.  Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit.  Handle
mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.

And there has been no collector of these hiwater statistics in the tree.  The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).

There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high.  A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.

What locking?  None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact.  But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:39 -07:00
Nick Piggin
b5810039a5 [PATCH] core remove PageReserved
Remove PageReserved() calls from core code by tightening VM_RESERVED
handling in mm/ to cover PageReserved functionality.

PageReserved special casing is removed from get_page and put_page.

All setting and clearing of PageReserved is retained, and it is now flagged
in the page_alloc checks to help ensure we don't introduce any refcount
based freeing of Reserved pages.

MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
deprecated.  We never completely handled it correctly anyway, and is be
reintroduced in future if required (Hugh has a proof of concept).

Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
be trivially removed.

Last real user of PageReserved is swsusp, which uses PageReserved to
determine whether a struct page points to valid memory or not.  This still
needs to be addressed (a generic page_is_ram() should work).

A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
thus mapcounted and count towards shared rss).  These writes to the struct
page could cause excessive cacheline bouncing on big systems.  There are a
number of ways this could be addressed if it is an issue.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Refcount bug fix for filemap_xip.c

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:39 -07:00
Hugh Dickins
7ee7823250 [PATCH] mm: dup_mmap down new mmap_sem
One anomaly remains from when Andrea rationalized the responsibilities of
mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its
page_table_lock, but not the mmap_sem which normally guards the vma list and
rbtree.  Which could be an issue for unuse_mm: though since it just walks down
the list (today with page_table_lock, tomorrow not), it's probably okay.  Will
need a memory barrier?  Oh, keep it simple, Nick and I agreed, no harm in
taking child's mmap_sem here.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:38 -07:00
Hugh Dickins
fd3e42fcc8 [PATCH] mm: dup_mmap use oldmm more
Use the parent's oldmm throughout dup_mmap, instead of perversely going back
to current->mm.  (Can you hear the sigh of relief from those mpnts?  Usually I
squash them, but not today.)

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:38 -07:00
Hugh Dickins
4294621f41 [PATCH] mm: rss = file_rss + anon_rss
I was lazy when we added anon_rss, and chose to change as few places as
possible.  So currently each anonymous page has to be counted twice, in rss
and in anon_rss.  Which won't be so good if those are atomic counts in some
configurations.

Change that around: keep file_rss and anon_rss separately, and add them
together (with get_mm_rss macro) when the total is needed - reading two
atomics is much cheaper than updating two atomics.  And update anon_rss
upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:38 -07:00
Hugh Dickins
404351e67a [PATCH] mm: mm_init set_mm_counters
How is anon_rss initialized?  In dup_mmap, and by mm_alloc's memset; but
that's not so good if an mm_counter_t is a special type.  And how is rss
initialized?  By set_mm_counter, all over the place.  Come on, we just need to
initialize them both at once by set_mm_counter in mm_init (which follows the
memcpy when forking).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:38 -07:00
Hugh Dickins
ab50b8ed81 [PATCH] mm: vm_stat_account unshackled
The original vm_stat_account has fallen into disuse, with only one user, and
only one user of vm_stat_unaccount.  It's easier to keep track if we convert
them all to __vm_stat_account, then free it from its __shackles.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:37 -07:00
YOSHIFUJI Hideaki
4b8f573b5d [PATCH] TIMERS: add missing compensation for HZ == 250
Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in
second_overflow().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:35 -07:00
Al Viro
943eae0314 [PATCH] missing exports of do_settimeofday() variants
frv, sh64, ia64 and sparc64 do not have do_settimeofday() exported (the
last two are using variant in kernel/time.c).  Exports added to match
the rest of architectures.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 10:35:07 -07:00
Oleg Nesterov
8d027de54c [PATCH] fix ->signal->live leak in copy_process()
exit_signal() (called from copy_process's error path) should decrement
->signal->live, otherwise forking process will miss 'group_dead' in
do_exit().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 10:28:13 -07:00
Al Viro
9796fdd829 [PATCH] gfp_t: kernel/*
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28 08:16:49 -07:00
Paul Mackerras
4542437679 Merge in v2.6.14 by hand 2005-10-28 13:38:53 +10:00
Roland McGrath
72ab373a56 [PATCH] Yet more posix-cpu-timer fixes
This just makes sure that a thread's expiry times can't get reset after
it clears them in do_exit.

This is what allowed us to re-introduce the stricter BUG_ON() check in
a362f463a6.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-27 09:08:43 -07:00
Linus Torvalds
a362f463a6 Revert "remove false BUG_ON() from run_posix_cpu_timers()"
This reverts commit 3de463c7d9.

Roland has another patch that allows us to leave the BUG_ON() in place
by just making sure that the condition it tests for really is always
true.

That goes in next.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-27 09:07:33 -07:00
Oleg Nesterov
7a4ed937aa [PATCH] Fix cpu timers expiration time
There's a silly off-by-one error in the code that updates the expiration
of posix CPU timers, causing them to not be properly updated when they
hit exactly on their expiration time (which should be the normal case).

This causes them to then fire immediately again, and only _then_ get
properly updated.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26 15:21:14 -07:00
Linus Torvalds
70ab81c2ed posix cpu timers: fix timer ordering
Pointed out by Oleg Nesterov, who has been walking over the code
forwards and backwards.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26 11:23:06 -07:00
Andrew Morton
bb32051532 [PATCH] export cpu_online_map
With CONFIG_SMP=n:

*** Warning: "cpu_online_map" [drivers/firmware/dcdbas.ko] undefined!

due to set_cpus_allowed().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-26 10:39:43 -07:00
Oleg Nesterov
a69ac4a78d [PATCH] posix-timers: fix posix_cpu_timer_set() vs run_posix_cpu_timers() race
This might be harmless, but looks like a race from code inspection (I
was unable to trigger it).  I must admit, I don't understand why we
can't return TIMER_RETRY after 'spin_unlock(&p->sighand->siglock)'
without doing bump_cpu_timer(), but this is what original code does.

posix_cpu_timer_set:

	read_lock(&tasklist_lock);

	spin_lock(&p->sighand->siglock);
	list_del_init(&timer->it.cpu.entry);
	spin_unlock(&p->sighand->siglock);

We are probaly deleting the timer from run_posix_cpu_timers's 'firing'
local list_head while run_posix_cpu_timers() does list_for_each_safe.

Various bad things can happen, for example we can just delete this timer
so that list_for_each() will not notice it and run_posix_cpu_timers()
will not reset '->firing' flag. In that case,

	....

	if (timer->it.cpu.firing) {
		read_unlock(&tasklist_lock);
		timer->it.cpu.firing = -1;
		return TIMER_RETRY;
	}

sys_timer_settime() goes to 'retry:', calls posix_cpu_timer_set() again,
it returns TIMER_RETRY ...

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24 08:13:14 -07:00
Oleg Nesterov
ca531a0a5e [PATCH] posix-timers: exit path cleanup
No need to rebalance when task exited

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24 08:12:35 -07:00
Oleg Nesterov
3de463c7d9 [PATCH] posix-timers: remove false BUG_ON() from run_posix_cpu_timers()
do_exit() clears ->it_##clock##_expires, but nothing prevents
another cpu to attach the timer to exiting process after that.

After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and
before do_exit() calls 'schedule() local timer interrupt can find
tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu
does sys_wait4) interrupted task has ->signal == NULL.

At this moment exiting task has no pending cpu timers, they were cleaned
up in __exit_signal()->posix_cpu_timers_exit{,_group}(), so we can just
return from irq.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24 08:12:35 -07:00
Oleg Nesterov
108150ea78 [PATCH] posix-timers: fix cleanup_timers() and run_posix_cpu_timers() races
1. cleanup_timers() sets timer->task = NULL under tasklist + ->sighand locks.
   That means that this code in posix_cpu_timer_del() and posix_cpu_timer_set()

   		lock_timer(timer);
		if (timer->task == NULL)
			return;
		read_lock(tasklist);
		put_task_struct(timer->task)

   is racy. With this patch timer->task modified and accounted only under
   timer->it_lock. Sadly, this means that dead task_struct won't be freed
   until timer deleted or armed.

2. run_posix_cpu_timers() collects expired timers into local list under
   tasklist + ->sighand again. That means that posix_cpu_timer_del()
   should check timer->it.cpu.firing under these locks too.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-24 08:12:35 -07:00
Linus Torvalds
e80eda94d3 Posix timers: limit number of timers firing at once
Bursty timers aren't good for anybody, very much including latency for
other programs when we trigger lots of timers in interrupt context.  So
set a random limit, after which we'll handle the rest on the next timer
tick.

Noted by Oleg Nesterov <oleg@tv-sign.ru>

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-23 10:02:50 -07:00
Paul Mackerras
985990137e Merge changes from linux-2.6 by hand 2005-10-22 16:51:34 +10:00
Roland McGrath
25f407f0b6 [PATCH] Call exit_itimers from do_exit, not __exit_signal
When I originally moved exit_itimers into __exit_signal, that was the only
place where we could reliably know it was the last thread in the group
dying, without races.  Since then we've gotten the signal_struct.live
counter, and do_exit can reliably do group-wide cleanup work.

This patch moves the call to do_exit, where it's made without locks.  This
avoids the deadlock issues that the old __exit_signal code's comment talks
about, and the one that Oleg found recently with process CPU timers.

[ This replaces e03d13e985, which is why
  it was just reverted. ]

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-21 15:38:08 -07:00
Linus Torvalds
9465bee863 Revert "Fix cpu timers exit deadlock and races"
Revert commit e03d13e985, to be replaced
by a much nicer fix from Roland.
2005-10-21 15:36:00 -07:00
Alan Stern
d1209d049b [PATCH] Threads shouldn't inherit PF_NOFREEZE
The PF_NOFREEZE process flag should not be inherited when a thread is
forked.  This patch (as585) removes the flag from the child.

This problem is starting to show up more and more as drivers turn to the
kthread API instead of using kernel_thread().  As a result, their kernel
threads are now children of the kthread worker instead of modprobe, and
they inherit the PF_NOFREEZE flag.  This can cause problems during system
suspend; the kernel threads are not getting frozen as they ought to be.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19 23:04:31 -07:00
Roland McGrath
e03d13e985 [PATCH] Fix cpu timers exit deadlock and races
Oleg Nesterov reported an SMP deadlock.  If there is a running timer
tracking a different process's CPU time clock when the process owning
the timer exits, we deadlock on tasklist_lock in posix_cpu_timer_del via
exit_itimers.

That code was using tasklist_lock to check for a race with __exit_signal
being called on the timer-target task and clearing its ->signal.
However, there is actually no such race.  __exit_signal will have called
posix_cpu_timers_exit and posix_cpu_timers_exit_group before it does
that.  Those will clear those k_itimer's association with the dying
task, so posix_cpu_timer_del will return early and never reach the code
in question.

In addition, posix_cpu_timer_del called from exit_itimers during execve
or directly from timer_delete in the process owning the timer can race
with an exiting timer-target task to cause a double put on timer-target
task struct.  Make sure we always access cpu_timers lists with sighand
lock held.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-19 23:02:01 -07:00
Eric Dumazet
5ee832dbc6 [PATCH] rcu: keep rcu callback event counter
This makes call_rcu() keep track of how many events there are on the RCU
list, and cause a reschedule event when the list gets too long.

This helps keep RCU event lists down.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-17 15:27:58 -07:00
Oleg Nesterov
47d6b08334 [PATCH] posix-timers: fix task accounting
Make sure we release the task struct properly when releasing pending
timers.

release_task() does write_lock_irq(&tasklist_lock), so it can't race
with run_posix_cpu_timers() on any cpu.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-17 15:00:00 -07:00
Linus Torvalds
2cc78eb52b Increase default RCU batching sharply
Dipankar made RCU limit the batch size to improve latency, but that
approach is unworkable: it can cause the RCU queues to grow without
bounds, since the batch limiter ended up limiting the callbacks.

So make the limit much higher, and start planning on instead limiting
the batch size by doing RCU callbacks more often if the queue looks like
it might be growing too long.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-17 09:10:15 -07:00
Takashi Iwai
c6ecf7ed31 [PATCH] Add missing export of getnstimeofday()
Adds the missing EXPORT_SYMBOL_GPL for getnstimeofday() when
CONFIG_TIME_INTERPOLATION isn't set.  Needed by drivers/char/mmtimer.c

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-14 17:10:12 -07:00
Paul Mackerras
b6ec995a21 Merge from Linus' tree 2005-10-12 14:43:32 +10:00
Harald Welte
46113830a1 [PATCH] Fix signal sending in usbdevio on async URB completion
If a process issues an URB from userspace and (starts to) terminate
before the URB comes back, we run into the issue described above.  This
is because the urb saves a pointer to "current" when it is posted to the
device, but there's no guarantee that this pointer is still valid
afterwards.

In fact, there are three separate issues:

1) the pointer to "current" can become invalid, since the task could be
   completely gone when the URB completion comes back from the device.

2) Even if the saved task pointer is still pointing to a valid task_struct,
   task_struct->sighand could have gone meanwhile.

3) Even if the process is perfectly fine, permissions may have changed,
   and we can no longer send it a signal.

So what we do instead, is to save the PID and uid's of the process, and
introduce a new kill_proc_info_as_uid() function.

Signed-off-by: Harald Welte <laforge@gnumonks.org>
[ Fixed up types and added symbol exports ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-10 16:16:33 -07:00
Rafael J. Wysocki
3dd083255d [PATCH] x86_64: Set up safe page tables during resume
The following patch makes swsusp avoid the possible temporary corruption
of page translation tables during resume on x86-64.  This is achieved by
creating a copy of the relevant page tables that will not be modified by
swsusp and can be safely used by it on resume.

The problem is that during resume on x86-64 swsusp may temporarily
corrupt the page tables used for the direct mapping of RAM.  If that
happens, a page fault occurs and cannot be handled properly, which leads
to the solid hang of the affected system.  This leads to the loss of the
system's state from before suspend and may result in the loss of data or
the corruption of filesystems, so it is a serious issue.  Also, it
appears to happen quite often (for me, as often as 50% of the time).

The problem is related to the fact that (at least) one of the PMD
entries used in the direct memory mapping (starting at PAGE_OFFSET)
points to a page table the physical address of which is much greater
than the physical address of the PMD entry itself.  Moreover,
unfortunately, the physical address of the page table before suspend
(i.e.  the one stored in the suspend image) happens to be different to
the physical address of the corresponding page table used during resume
(i.e.  the one that is valid right before swsusp_arch_resume() in
arch/x86_64/kernel/suspend_asm.S is executed).  Thus while the image is
restored, the "offending" PMD entry gets overwritten, so it does not
point to the right physical address any more (i.e.  there's no page
table at the address pointed to by it, because it points to the address
the page table has been at during suspend).  Consequently, if the PMD
entry is used later on, and it _is_ used in the process of copying the
image pages, a page fault occurs, but it cannot be handled in the normal
way and the system hangs.

In principle we can call create_resume_mapping() from
swsusp_arch_resume() (ie.  from suspend_asm.S), but then the memory
allocations in create_resume_mapping(), resume_pud_mapping(), and
resume_pmd_mapping() must be made carefully so that we use _only_
NosaveFree pages in them (the other pages are overwritten by the loop in
swsusp_arch_resume()).  Additionally, we are in atomic context at that
time, so we cannot use GFP_KERNEL.  Moreover, if one of the allocations
fails, we should free all of the allocated pages, so we need to trace
them somehow.

All of this is done in the appended patch, except that the functions
populating the page tables are located in arch/x86_64/kernel/suspend.c
rather than in init.c.  It may be done in a more elegan way in the
future, with the help of some swsusp patches that are in the works now.

[AK: move some externs into headers, renamed a function]

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-10 08:36:46 -07:00
Al Viro
dd0fc66fb3 [PATCH] gfp flags annotations - part 1
- added typedef unsigned int __nocast gfp_t;

 - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
   the same warnings as far as sparse is concerned, doesn't change
   generated code (from gcc point of view we replaced unsigned int with
   typedef) and documents what's going on far better.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08 15:00:57 -07:00
Oleg Nesterov
788e05a67c [PATCH] fix do_coredump() vs SIGSTOP race
Let's suppose we have 2 threads in thread group:
	A - does coredump
	B - has pending SIGSTOP

thread A						thread B

do_coredump:						get_signal_to_deliver:

  lock(->sighand)
  ->signal->flags = SIGNAL_GROUP_EXIT
  unlock(->sighand)

							lock(->sighand)
							signr = dequeue_signal()
								->signal->flags |= SIGNAL_STOP_DEQUEUED
								return SIGSTOP;

							do_signal_stop:
							    unlock(->sighand)

  coredump_wait:

      zap_threads:
          lock(tasklist_lock)
          send SIGKILL to B
              // signal_wake_up() does nothing
          unlock(tasklist_lock)

							    lock(tasklist_lock)
							    lock(->sighand)
							    re-check sig->flags & SIGNAL_STOP_DEQUEUED, yes
							    set_current_state(TASK_STOPPED);
							    finish_stop:
							        schedule();
							            // ->state == TASK_STOPPED

      wait_for_completion(&startup_done)
         // waits for complete() from B,
         // ->state == TASK_UNINTERRUPTIBLE

We can't wake up 'B' in any way:

	SIGCONT will be ignored because handle_stop_signal() sees
	->signal->flags & SIGNAL_GROUP_EXIT.

	sys_kill(SIGKILL)->__group_complete_signal() will choose
	uninterruptible 'A', so it can't help.

	sys_tkill(B, SIGKILL) will be ignored by specific_send_sig_info()
	because B already has pending SIGKILL.

This scenario is not possbile if 'A' does do_group_exit(), because
it sets sig->flags = SIGNAL_GROUP_EXIT and delivers SIGKILL to
subthreads atomically, holding both tasklist_lock and sighand->lock.
That means that do_signal_stop() will notice !SIGNAL_STOP_DEQUEUED
after re-locking ->sighand. And it is not possible to any other
thread to re-add SIGNAL_STOP_DEQUEUED later, because dequeue_signal()
can only return SIGKILL.

I think it is better to change do_coredump() to do sigaddset(SIGKILL)
and signal_wake_up() under sighand->lock, but this patch is much
simpler.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08 14:53:31 -07:00
Linus Torvalds
14bf01bb05 Fix inequality comparison against "task->state"
We should always use bitmask ops, rather than depend on some ordering of
the different states.  With the TASK_NONINTERACTIVE flag, the inequality
doesn't really work.

Oleg Nesterov argues (likely correctly) that this test is unnecessary in
the first place.  However, the minimal fix for now is to at least make
it work in the presense of TASK_NONINTERACTIVE.  Waiting for consensus
from Roland & co on potential bigger cleanups.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-01 11:04:18 -07:00
Al Viro
eacaa1f5aa [PATCH] cpuset crapectomy
Switched cpuset_common_file_read() to simple_read_from_buffer(), killed
a bunch of useless (and not quite correct - e.g.  min(size_t,ssize_t))
code.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-30 08:42:24 -07:00
Roland McGrath
5acbc5cb50 [PATCH] Fix task state testing properly in do_signal_stop()
Any tests using < TASK_STOPPED or the like are left over from the time
when the TASK_ZOMBIE and TASK_DEAD bits were in the same word, and it
served to check for "stopped or dead".  I think this one in
do_signal_stop is the only such case.  It has been buggy ever since
exit_state was separated, and isn't testing the exit_state value.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-29 15:20:47 -07:00
Paul Mackerras
ab11d1ea28 Merge by hand from Linus' tree.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2005-09-29 13:13:36 +10:00
Paul Jackson
5134fc15b6 [PATCH] cpuset read past eof memory leak fix
Don't leak a page of memory if user reads a cpuset file past eof.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28 07:58:51 -07:00
Rafael J. Wysocki
0f7347c20c [PATCH] swsusp: avoid problems if there are too many pages to save
The following patch makes swsusp avoid problems during resume if there are
too many pages to save on suspend.  It adds a constant that allows us to
verify if we are going to save too many pages and implements the check
(this is done as early as we can tell that the check will trigger, which is
in swsusp_alloc()).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28 07:46:41 -07:00
Rusty Russell
f36462f078 [PATCH] Ignore trailing whitespace on kernel parameters correctly
Dave Jones says:

... if the modprobe.conf has trailing whitespace, modules fail to load
with the following helpful message..

	snd_intel8x0: Unknown parameter `'

Previous version truncated last argument.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28 07:46:41 -07:00
Rafael J. Wysocki
f2d613799a [PATCH] swsusp: prevent possible memory leak
Prevent swsusp from leaking some memory in case of an error in
read_pagedir().  It also prevents the BUG_ON() from triggering if there's
an error while reading swap.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28 07:46:40 -07:00
Rafael J. Wysocki
254b54771c [PATCH] swsusp: remove wrong code from data_free
The following patch removes some wrong code from the data_free() function
in swsusp.

This function could only be called if there's an error while writing the
suspend image to swap, so it is not triggered easily.  However, if
triggered, it would probably corrupt some memory.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-28 07:46:40 -07:00
Paul Mackerras
beeca08738 Don't call a NULL ack function in the generic IRQ code.
Some IRQ controllers don't need an ack function (e.g. OpenPIC on
PPC platforms) and for them we'd rather not have the overhead
of doing an indirect call to a function that does nothing.

Signed-off-by: Paul Mackerras <paulus@samba.org>
2005-09-28 20:29:44 +10:00
Linus Torvalds
188a1eafa0 Make sure SIGKILL gets proper respect
Bhavesh P. Davda <bhavesh@avaya.com> noticed that SIGKILL wouldn't
properly kill a process under just the right cicumstances: a stopped
task that already had another signal queued would get the SIGKILL
queued onto the shared queue, and there it would remain until SIGCONT.

This simplifies the signal acceptance logic, and fixes the bug in the
process.

Losely based on an earlier patch by Bhavesh.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-23 13:22:21 -07:00
Pavel Machek
8686bcd0a5 [PATCH] swsusp: fix comments
Fix comments in swsusp.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22 22:17:36 -07:00
Rafael J. Wysocki
57487f4376 [PATCH] swsusp: do not trigger BUG_ON() if there is not enough memory
The following patch makes swsusp avoid triggering the BUG_ON() in
swsusp_suspend() if there is not enough memory for suspend.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22 22:17:35 -07:00
Randy Dunlap
720b9429e8 [PATCH] SOFTWARE_SUSPEND needs HOTPLUG_CPU on SMP
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22 22:17:34 -07:00
Eric W. Biederman
88d10bbaae [PATCH] suspend: cleanup calling of power off methods.
In the lead up to 2.6.13 I fixed a large number of reboot problems by
making the calling conventions consistent.  Despite checking and double
checking my work it appears I missed an obvious one.

The S4 suspend code for PM_DISK_PLATFORM was also calling device_shutdown
without setting system_state, and was not calling the appropriate
reboot_notifier.

This patch fixes the bug by replacing the call of device_suspend with
kernel_poweroff_prepare.

Various forms of this failure have been fixed and tracked for a while.

Thanks for tracking this down go to: Alexey Starikovskiy, Meelis Roos
<mroos@linux.ee>, Nigel Cunningham <ncunningham@cyclades.com>, Pierre
Ossman <drzeus-list@drzeus.cx>

History of this bug is at:
http://bugme.osdl.org/show_bug.cgi?id=4320

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22 22:17:33 -07:00
Eric W. Biederman
e4c94330e3 [PATCH] reboot: comment and factor the main reboot functions
In the lead up to 2.6.13 I fixed a large number of reboot problems by
making the calling conventions consistent.  Despite checking and double
checking my work it appears I missed an obvious one.

This first patch simply refactors the reboot routines so all of the
preparation for various kinds of reboots are in their own functions.
Making it very hard to get the various kinds of reboot out of sync.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-22 22:17:33 -07:00
Andrew Morton
31f6d9d628 [PATCH] Add printk_clock()
ia64's sched_clock() accesses per-cpu data which isn't set up at boot time.
Hence ia64 cannot use printk timestamping, because printk() will crash in
sched_clock().

So make printk() use printk_clock(), defaulting to sched_clock(), overrideable
by the architecture via attribute(weak).

Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-21 10:11:54 -07:00
Dipankar Sarma
4fb3a53860 [PATCH] files: fix preemption issues
With the new fdtable locking rules, you have to protect fdtable with either
->file_lock or rcu_read_lock/unlock().  There are some places where we
aren't doing either.  This patch fixes those places.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17 11:50:02 -07:00
Michael Kerrisk
2030c0fd3d [PATCH] PR_GET_DUMPABLE returns incorrect info
2.6.13 incorporated Alan Cox's patch for /proc/sys/fs/suid_dumpable (one
version of this patch can be found here
http://marc.theaimsgroup.com/?l=linux-kernel&m=109647550421014&w=2 ).

This patch also made corresponding changes in kernel/sys.c to change the
prctl() PR_SET_DUMPABLE operation so that the permitted range of 'arg2' was
modified from 0..1 to 0..2.

However, a corresponding change was not made for PR_GET_DUMPABLE: if the
dumpable flag is non-zero, then PR_GET_DUMPABLE always returns 1, so that
the caller can't determine the true setting of this flag.

Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17 11:50:01 -07:00
Srivatsa Vaddagiri
26ff6ad978 [PATCH] CPU hotplug breaks wake_up_new_task
Fix a problem wherein a new-born task is added to a dead CPU.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-17 11:49:59 -07:00
Ingo Molnar
da04c03503 [PATCH] Fix spinlock owner debugging
fix up the runqueue lock owner only if we truly did a context-switch
with the runqueue lock held. Impacts ia64, mips, sparc64 and arm.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13 09:59:04 -07:00
Linus Torvalds
5d54e69c68 Merge master.kernel.org:/pub/scm/linux/kernel/git/dwmw2/audit-2.6 2005-09-13 09:47:30 -07:00
Randy Dunlap
9f1583339a [PATCH] use add_taint() for setting tainted bit flags
Use the add_taint() interface for setting tainted bit flags instead of
doing it manually.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13 08:22:29 -07:00
Andrew Morton
8a1c17574a [PATCH] schedule_timeout_[un]interruptible() speedup
These functions don't need schedule_timeout()'s barrier.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-13 08:22:29 -07:00
Andi Kleen
3f74478b5f [PATCH] x86-64: Some cleanup and optimization to the processor data area.
- Remove unused irqrsp field
- Remove pda->me
- Optimize set_softirq_pending slightly

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12 10:49:58 -07:00
Paul Jackson
b3426599af [PATCH] cpuset semaphore depth check optimize
Optimize the deadlock avoidance check on the global cpuset
semaphore cpuset_sem.  Instead of adding a depth counter to the
task struct of each task, rather just two words are enough, one
to store the depth and the other the current cpuset_sem holder.

Thanks to Nikita Danilov for the idea.

Signed-off-by: Paul Jackson <pj@sgi.com>

[ We may want to change this further, but at least it's now
  a totally internal decision to the cpusets code ]

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12 09:16:27 -07:00
Linus Torvalds
1df5c10a5b Mark ia64-specific MCA/INIT scheduler hooks as dangerous
..and only enable them for ia64. The functions are only valid
when the whole system has been totally stopped and no scheduler
activity is ongoing on any CPU, and interrupts are globally
disabled.

In other words, they aren't useful for anything else. So make
sure that nobody can use them by mistake.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-12 07:59:21 -07:00
Keith Owens
a2a979821b [PATCH] MCA/INIT: scheduler hooks
Scheduler hooks to see/change which process is deemed to be on a cpu.

Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-11 14:01:30 -07:00
Nishanth Aravamudan
75bcc8c5e1 [PATCH] kernel: fix-up schedule_timeout() usage
Use schedule_timeout_{,un}interruptible() instead of
set_current_state()/schedule_timeout() to reduce kernel size.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:37 -07:00
Nishanth Aravamudan
64ed93a268 [PATCH] add schedule_timeout_{,un}interruptible() interfaces
Add schedule_timeout_{,un}interruptible() interfaces so that
schedule_timeout() callers don't have to worry about forgetting to add the
set_current_state() call beforehand.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:36 -07:00
Randy Dunlap
417ef53141 [PATCH] kernel/acct: add kerneldoc
for kernel/acct.c:
- fix typos
- add kerneldoc for non-static functions

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:26 -07:00
Siddha, Suresh B
0c117f1b4d [PATCH] sched: allow the load to grow upto its cpu_power
Don't pull tasks from a group if that would cause the group's total load to
drop below its total cpu_power (ie.  cause the group to start going idle).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:24 -07:00
Siddha, Suresh B
fa3b6ddc3f [PATCH] sched: don't kick ALB in the presence of pinned task
Jack Steiner brought this issue at my OLS talk.

Take a scenario where two tasks are pinned to two HT threads in a physical
package.  Idle packages in the system will keep kicking migration_thread on
the busy package with out any success.

We will run into similar scenarios in the presence of CMP/NUMA.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:24 -07:00
Renaud Lienhart
5927ad78ec [PATCH] sched: use cached variable in sys_sched_yield()
In sys_sched_yield(), we cache current->array in the "array" variable, thus
there's no need to dereference "current" again later.

Signed-Off-By: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Nick Piggin
5969fe0618 [PATCH] sched: HT optimisation
If an idle sibling of an HT queue encounters a busy sibling, then make
higher level load balancing of the non-idle variety.

Performance of multiprocessor HT systems with low numbers of tasks
(generally < number of virtual CPUs) can be significantly worse than the
exact same workloads when running in non-HT mode.  The reason is largely
due to poor scheduling behaviour.

This patch improves the situation, making the performance gap far less
significant on one problematic test case (tbench).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Nick Piggin
e17224bf1d [PATCH] sched: less locking
During periodic load balancing, don't hold this runqueue's lock while
scanning remote runqueues, which can take a non trivial amount of time
especially on very large systems.

Holding the runqueue lock will only help to stabilise ->nr_running, however
this doesn't do much to help because tasks being woken will simply get held
up on the runqueue lock, so ->nr_running would not provide a really
accurate picture of runqueue load in that case anyway.

What's more, ->nr_running (and possibly the cpu_load averages) of remote
runqueues won't be stable anyway, so load balancing is always an inexact
operation.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Nick Piggin
d6d5cfaf45 [PATCH] sched: less newidle locking
Similarly to the earlier change in load_balance, only lock the runqueue in
load_balance_newidle if the busiest queue found has a nr_running > 1.  This
will reduce frequency of expensive remote runqueue lock aquisitions in the
schedule() path on some workloads.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Ingo Molnar
67f9a619e7 [PATCH] sched: fix SMT scheduler latency bug
William Weston reported unusually high scheduling latencies on his x86 HT
box, on the -RT kernel.  I managed to reproduce it on my HT box and the
latency tracer shows the incident in action:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup (try_to_wake_up)
        ..............................................................
        ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ...
        ..............................................................
      du-2803  3Dnh2    0us : __trace_start_sched_wakeup <<...>-2778> (73 1)
      du-2803  3Dnh2    0us : _raw_spin_unlock (try_to_wake_up)
        ................................................
        ... still on CPU#3, we send an IPI to CPU#1: ...
        ................................................
      du-2803  3Dnh1    0us : resched_task (try_to_wake_up)
      du-2803  3Dnh1    1us : smp_send_reschedule (try_to_wake_up)
      du-2803  3Dnh1    1us : send_IPI_mask_bitmask (smp_send_reschedule)
      du-2803  3Dnh1    2us : _raw_spin_unlock_irqrestore (try_to_wake_up)
        ...............................................
        ... 1 usec later, the IPI arrives on CPU#1: ...
        ...............................................
  <idle>-0     1Dnh.    2us : smp_reschedule_interrupt (c0100c5a 0 0)

So far so good, this is the normal wakeup/preemption mechanism.  But here
comes the scheduler anomaly on CPU#1:

  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    2us : preempt_schedule_irq (need_resched)
  <idle>-0     1Dnh.    3us : __schedule (preempt_schedule_irq)
  <idle>-0     1Dnh.    3us : profile_hit (__schedule)
  <idle>-0     1Dnh1    3us : sched_clock (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irq (__schedule)
  <idle>-0     1Dnh1    4us : _raw_spin_lock_irqsave (__schedule)
  <idle>-0     1Dnh2    5us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh1    5us : preempt_schedule (__schedule)
  <idle>-0     1Dnh1    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh2    6us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    6us : _raw_spin_lock (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    7us : find_next_bit (__schedule)
  <idle>-0     1Dnh3    8us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh2    8us : preempt_schedule (__schedule)
  <idle>-0     1Dnh2    8us : find_next_bit (__schedule)
  <idle>-0     1Dnh2    9us : trace_stop_sched_switched (__schedule)
  <idle>-0     1Dnh2    9us : _raw_spin_lock (trace_stop_sched_switched)
  <idle>-0     1Dnh3   10us : trace_stop_sched_switched <<...>-2778> (73 8c)
  <idle>-0     1Dnh3   10us : _raw_spin_unlock (trace_stop_sched_switched)
  <idle>-0     1Dnh1   10us : _raw_spin_unlock (__schedule)
  <idle>-0     1Dnh.   11us : local_irq_enable_noresched (preempt_schedule_irq)
  <idle>-0     1Dnh.   11us < (0)

we didnt pick up pid 2778! It only gets scheduled much later:

   <...>-2778  1Dnh2  412us : __switch_to (__schedule)
   <...>-2778  1Dnh2  413us : __schedule <<idle>-0> (8c 73)
   <...>-2778  1Dnh2  413us : _raw_spin_unlock (__schedule)
   <...>-2778  1Dnh1  413us : trace_stop_sched_switched (__schedule)
   <...>-2778  1Dnh1  414us : _raw_spin_lock (trace_stop_sched_switched)
   <...>-2778  1Dnh2  414us : trace_stop_sched_switched <<...>-2778> (73 1)
   <...>-2778  1Dnh2  414us : _raw_spin_unlock (trace_stop_sched_switched)
   <...>-2778  1Dnh1  415us : trace_stop_sched_switched (__schedule)

the reason for this anomaly is the following code in dependent_sleeper():

                /*
                 * If a user task with lower static priority than the
                 * running task on the SMT sibling is trying to schedule,
                 * delay it till there is proportionately less timeslice
                 * left of the sibling task to prevent a lower priority
                 * task from using an unfair proportion of the
                 * physical cpu's resources. -ck
                 */
[...]
                        if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
                                100) > task_timeslice(p)))
                                        ret = 1;

Note that in contrast to the comment above, we dont actually do the check
based on static priority, we do the check based on timeslices.  But
timeslices go up and down, and even highprio tasks can randomly have very
low timeslices (just before their next refill) and can thus be judged as
'lowprio' by the above piece of code.  This condition is clearly buggy.
The correct test is to check for static_prio _and_ to check for the
preemption priority.  Even on different static priority levels, a
higher-prio interactive task should not be delayed due to a
higher-static-prio CPU hog.

There is a symmetric bug in the 'kick SMT sibling' code of this function as
well, which can be solved in a similar way.

The patch below (against the current scheduler queue in -mm) fixes both
bugs.  I have build and boot-tested this on x86 SMT, and nice +20 tasks
still get properly throttled - so the dependent-sleeper logic is still in
action.

btw., these bugs pessimised the SMT scheduler because the 'delay wakeup'
property was applied too liberally, so this fix is likely a throughput
improvement as well.

I separated out a smt_slice() function to make the code easier to read.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:23 -07:00
Ingo Molnar
d79fc0fc66 [PATCH] sched: TASK_NONINTERACTIVE
This patch implements a task state bit (TASK_NONINTERACTIVE), which can be
used by blocking points to mark the task's wait as "non-interactive".  This
does not mean the task will be considered a CPU-hog - the wait will simply
not have an effect on the waiting task's priority - positive or negative
alike.  Right now only pipe_wait() will make use of it, because it's a
common source of not-so-interactive waits (kernel compilation jobs, etc.).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
Ingo Molnar
95cdf3b799 [PATCH] sched cleanups
whitespace cleanups.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
M.Baris Demiray
da5a552270 [PATCH] sched: make idlest_group/cpu cpus_allowed-aware
Add relevant checks into find_idlest_group() and find_idlest_cpu() to make
them return only the groups that have allowed CPUs and allowed CPUs
respectively.

Signed-off-by: M.Baris Demiray <baris@labristeknoloji.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
Con Kolivas
fc38ed7531 [PATCH] sched: run SCHED_NORMAL tasks with real time tasks on SMT siblings
The hyperthread aware nice handling currently puts to sleep any non real
time task when a real time task is running on its sibling cpu.  This can
lead to prolonged starvation by having the non real time task pegged to the
cpu with load balancing not pulling that task away.

Currently we force lower priority hyperthread tasks to run a percentage of
time difference based on timeslice differences which is meaningless when
comparing real time tasks to SCHED_NORMAL tasks.  We can allow non real
time tasks to run with real time tasks on the sibling up to per_cpu_gain%
if we use jiffies as a counter.

Cleanups and micro-optimisations to the relevant code section should make
it more understandable as well.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:22 -07:00
Paul Jackson
4247bdc600 [PATCH] cpuset semaphore depth check deadlock fix
The cpusets-formalize-intermediate-gfp_kernel-containment patch
has a deadlock problem.

This patch was part of a set of four patches to make more
extensive use of the cpuset 'mem_exclusive' attribute to
manage kernel GFP_KERNEL memory allocations and to constrain
the out-of-memory (oom) killer.

A task that is changing cpusets in particular ways on a system
when it is very short of free memory could double trip over
the global cpuset_sem semaphore (get the lock and then deadlock
trying to get it again).

The second attempt to get cpuset_sem would be in the routine
cpuset_zone_allowed().  This was discovered by code inspection.
I can not reproduce the problem except with an artifically
hacked kernel and a specialized stress test.

In real life you cannot hit this unless you are manipulating
cpusets, and are very unlikely to hit it unless you are rapidly
modifying cpusets on a memory tight system.  Even then it would
be a rare occurence.

If you did hit it, the task double tripping over cpuset_sem
would deadlock in the kernel, and any other task also trying
to manipulate cpusets would deadlock there too, on cpuset_sem.
Your batch manager would be wedged solid (if it was cpuset
savvy), but classic Unix shells and utilities would work well
enough to reboot the system.

The unusual condition that led to this bug is that unlike most
semaphores, cpuset_sem _can_ be acquired while in the page
allocation code, when __alloc_pages() calls cpuset_zone_allowed.
So it easy to mistakenly perform the following sequence:
  1) task makes system call to alter a cpuset
  2) take cpuset_sem
  3) try to allocate memory
  4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
  5) deadlock

The reason that this is not a serious bug for most users
is that almost all calls to allocate memory don't require
taking cpuset_sem.  Only some code paths off the beaten
track require taking cpuset_sem -- which is good.  Taking
a global semaphore on the main code path for allocating
memory would not scale well.

This patch fixes this deadlock by wrapping the up() and down()
calls on cpuset_sem in kernel/cpuset.c with code that tracks
the nesting depth of the current task on that semaphore, and
only does the real down() if the task doesn't hold the lock
already, and only does the real up() if the nesting depth
(number of unmatched downs) is exactly one.

The previous required use of refresh_mems(), anytime that
the cpuset_sem semaphore was acquired and the code executed
while holding that semaphore might try to allocate memory, is
no longer required.  Two refresh_mems() calls were removed
thanks to this.  This is a good change, as failing to get
all the necessary refresh_mems() calls placed was a primary
source of bugs in this cpuset code.  The only remaining call
to refresh_mems() is made while doing a memory allocation,
if certain task memory placement data needs to be updated
from its cpuset, due to the cpuset having been changed behind
the tasks back.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:21 -07:00
Ingo Molnar
fb1c8f93d8 [PATCH] spinlock consolidation
This patch (written by me and also containing many suggestions of Arjan van
de Ven) does a major cleanup of the spinlock code.  It does the following
things:

 - consolidates and enhances the spinlock/rwlock debugging code

 - simplifies the asm/spinlock.h files

 - encapsulates the raw spinlock type and moves generic spinlock
   features (such as ->break_lock) into the generic code.

 - cleans up the spinlock code hierarchy to get rid of the spaghetti.

Most notably there's now only a single variant of the debugging code,
located in lib/spinlock_debug.c.  (previously we had one SMP debugging
variant per architecture, plus a separate generic one for UP builds)

Also, i've enhanced the rwlock debugging facility, it will now track
write-owners.  There is new spinlock-owner/CPU-tracking on SMP builds too.
All locks have lockup detection now, which will work for both soft and hard
spin/rwlock lockups.

The arch-level include files now only contain the minimally necessary
subset of the spinlock code - all the rest that can be generalized now
lives in the generic headers:

 include/asm-i386/spinlock_types.h       |   16
 include/asm-x86_64/spinlock_types.h     |   16

I have also split up the various spinlock variants into separate files,
making it easier to see which does what. The new layout is:

   SMP                         |  UP
   ----------------------------|-----------------------------------
   asm/spinlock_types_smp.h    |  linux/spinlock_types_up.h
   linux/spinlock_types.h      |  linux/spinlock_types.h
   asm/spinlock_smp.h          |  linux/spinlock_up.h
   linux/spinlock_api_smp.h    |  linux/spinlock_api_up.h
   linux/spinlock.h            |  linux/spinlock.h

/*
 * here's the role of the various spinlock/rwlock related include files:
 *
 * on SMP builds:
 *
 *  asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the
 *                        initializers
 *
 *  linux/spinlock_types.h:
 *                        defines the generic type and initializers
 *
 *  asm/spinlock.h:       contains the __raw_spin_*()/etc. lowlevel
 *                        implementations, mostly inline assembly code
 *
 *   (also included on UP-debug builds:)
 *
 *  linux/spinlock_api_smp.h:
 *                        contains the prototypes for the _spin_*() APIs.
 *
 *  linux/spinlock.h:     builds the final spin_*() APIs.
 *
 * on UP builds:
 *
 *  linux/spinlock_type_up.h:
 *                        contains the generic, simplified UP spinlock type.
 *                        (which is an empty structure on non-debug builds)
 *
 *  linux/spinlock_types.h:
 *                        defines the generic type and initializers
 *
 *  linux/spinlock_up.h:
 *                        contains the __raw_spin_*()/etc. version of UP
 *                        builds. (which are NOPs on non-debug, non-preempt
 *                        builds)
 *
 *   (included on UP-non-debug builds:)
 *
 *  linux/spinlock_api_up.h:
 *                        builds the _spin_*() APIs.
 *
 *  linux/spinlock.h:     builds the final spin_*() APIs.
 */

All SMP and UP architectures are converted by this patch.

arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via
crosscompilers.  m32r, mips, sh, sparc, have not been tested yet, but should
be mostly fine.

From: Grant Grundler <grundler@parisc-linux.org>

  Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU).
  Builds 32-bit SMP kernel (not booted or tested).  I did not try to build
  non-SMP kernels.  That should be trivial to fix up later if necessary.

  I converted bit ops atomic_hash lock to raw_spinlock_t.  Doing so avoids
  some ugly nesting of linux/*.h and asm/*.h files.  Those particular locks
  are well tested and contained entirely inside arch specific code.  I do NOT
  expect any new issues to arise with them.

 If someone does ever need to use debug/metrics with them, then they will
  need to unravel this hairball between spinlocks, atomic ops, and bit ops
  that exist only because parisc has exactly one atomic instruction: LDCW
  (load and clear word).

From: "Luck, Tony" <tony.luck@intel.com>

   ia64 fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjanv@infradead.org>
Signed-off-by: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:21 -07:00
Dipankar Sarma
ab2af1f500 [PATCH] files: files struct with RCU
Patch to eliminate struct files_struct.file_lock spinlock on the reader side
and use rcu refcounting rcuref_xxx api for the f_count refcounter.  The
updates to the fdtable are done by allocating a new fdtable structure and
setting files->fdt to point to the new structure.  The fdtable structure is
protected by RCU thereby allowing lock-free lookup.  For fd arrays/sets that
are vmalloced, we use keventd to free them since RCU callbacks can't sleep.  A
global list of fdtable to be freed is not scalable, so we use a per-cpu list.
If keventd is already handling the current cpu's work, we use a timer to defer
queueing of that work.

Since the last publication, this patch has been re-written to avoid using
explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
premitives instead.  This required that the fd information is kept in a
separate structure (fdtable) and updated atomically.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:55 -07:00
Dipankar Sarma
badf16621c [PATCH] files: break up files struct
In order for the RCU to work, the file table array, sets and their sizes must
be updated atomically.  Instead of ensuring this through too many memory
barriers, we put the arrays and their sizes in a separate structure.  This
patch takes the first step of putting the file table elements in a separate
structure fdtable that is embedded withing files_struct.  It also changes all
the users to refer to the file table using files_fdtable() macro.  Subsequent
applciation of RCU becomes easier after this.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:55 -07:00
Dipankar Sarma
c0dfb29051 [PATCH] files: rcuref APIs
Adds a set of primitives to do reference counting for objects that are looked
up without locks using RCU.

Signed-off-by: Ravikiran Thirumalai <kiran_th@gmail.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:54 -07:00
KUROSAWA Takahiro
73a358d189 [PATCH] fix for cpusets minor problem
This patch fixes minor problem that the CPUSETS have when files in the
cpuset filesystem are read after lseek()-ed beyond the EOF.

Signed-off-by: KUROSAWA Takahiro <kurosawa@valinux.co.jp>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:32 -07:00
Chen, Kenneth W
383f2835eb [PATCH] Prefetch kernel stacks to speed up context switch
For architecture like ia64, the switch stack structure is fairly large
(currently 528 bytes).  For context switch intensive application, we found
that significant amount of cache misses occurs in switch_to() function.
The following patch adds a hook in the schedule() function to prefetch
switch stack structure as soon as 'next' task is determined.  This allows
maximum overlap in prefetch cache lines for that structure.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:31 -07:00
Jason Baron
b0d62e6d5b [PATCH] fix disassociate_ctty vs. fork race
Race is as follows. Process A forks process B, both being part of the same
session. Then, A calls disassociate_ctty while B forks C:

A				B
====				====
				fork()
				  copy_signal()
dissasociate_ctty()		....
				  attach_pid(p, PIDTYPE_SID, p->signal->session);

Now, C can have current->signal->tty pointing to a freed tty structure, as
it hasn't yet been added to the session group (to have its controlling tty
cleared on the diassociate_ctty() call).

This has shown up as an oops but could be even more serious.  I haven't
tried to create a test case, but a customer has verified that the patch
below resolves the issue, which was occuring quite frequently.  I'll try
and post the test case if i can.

The patch simply checks for a NULL tty *after* it has been attached to the
proper session group and clears it as necessary.  Alternatively, we could
simply do the tty assignment after the the process is added to the proper
session group.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:31 -07:00
Giancarlo Formicuccia
4b5d37ac02 [PATCH] Clear task_struct->fs_excl on fork()
An oversight.  We don't want to carry the IO scheduler's "we hold exclusive fs
resources" hint over to the child across fork().

Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:56:43 -07:00
Len Brown
64e47488c9 Merge linux-2.6 with linux-acpi-2.6 2005-09-08 01:45:47 -04:00
Linus Torvalds
0dd7f883a9 Merge branch 'upstream' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6 2005-09-07 17:28:25 -07:00
Keshavamurthy Anil S
deac66ae45 [PATCH] kprobes: fix bug when probed on task and isr functions
This patch fixes a race condition where in system used to hang or sometime
crash within minutes when kprobes are inserted on ISR routine and a task
routine.

The fix has been stress tested on i386, ia64, pp64 and on x86_64.  To
reproduce the problem insert kprobes on schedule() and do_IRQ() functions
and you should see hang or system crash.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:58:01 -07:00
Prasanna S Panchamukhi
d0aaff9796 [PATCH] Kprobes: prevent possible race conditions generic
There are possible race conditions if probes are placed on routines within the
kprobes files and routines used by the kprobes.  For example if you put probe
on get_kprobe() routines, the system can hang while inserting probes on any
routine such as do_fork().  Because while inserting probes on do_fork(),
register_kprobes() routine grabs the kprobes spin lock and executes
get_kprobe() routine and to handle probe of get_kprobe(), kprobes_handler()
gets executed and tries to grab kprobes spin lock, and spins forever.  This
patch avoids such possible race conditions by preventing probes on routines
within the kprobes file and routines used by kprobes.

I have modified the patches as per Andi Kleen's suggestion to move kprobes
routines and other routines used by kprobes to a seperate section
.kprobes.text.

Also moved page fault and exception handlers, general protection fault to
.kprobes.text section.

These patches have been tested on i386, x86_64 and ppc64 architectures, also
compiled on ia64 and sparc64 architectures.

Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:59 -07:00
Pekka J Enberg
dd3927105b [PATCH] introduce and use kzalloc
This patch introduces a kzalloc wrapper and converts kernel/ to use it.  It
saves a little program text.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:45 -07:00
Miklos Szeredi
ab8d11beb4 [PATCH] remove duplicated code from proc and ptrace
Extract common code used by ptrace_attach() and may_ptrace_attach()
into a separate function.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:43 -07:00
John Hawkes
0811bab24f [PATCH] cpusets: re-enable "dynamic sched domains"
Revert the hack introduced last week.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:41 -07:00
John Hawkes
d1b551386a [PATCH] cpusets: fix the "dynamic sched domains" bug
For a NUMA system with multiple CPUs per node, declaring a cpu-exclusive
cpuset that includes only some, but not all, of the CPUs in a node will mangle
the sched domain structures.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Cc; Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:41 -07:00
John Hawkes
9c1cfda20a [PATCH] cpusets: Move the ia64 domain setup code to the generic code
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:40 -07:00
Paul Jackson
ef08e3b498 [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset
Now the real motivation for this cpuset mem_exclusive patch series seems
trivial.

This patch keeps a task in or under one mem_exclusive cpuset from provoking an
oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
containment, there is little to gain from oom killing a task under a
non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
allocation must come from disjoint memory nodes.

This patch enables configuring a system so that a runaway job under one
mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
that might be using very high compute and memory resources for a prolonged
time.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:40 -07:00
Paul Jackson
9bf2229f88 [PATCH] cpusets: formalize intermediate GFP_KERNEL containment
This patch makes use of the previously underutilized cpuset flag
'mem_exclusive' to provide what amounts to another layer of memory placement
resolution.  With this patch, there are now the following four layers of
memory placement available:

 1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
 3) The current tasks cpuset (GFP_USER allocations constrained to here), and
 4) Specific node placement, using mbind and set_mempolicy.

These nest - each layer is a subset (same or within) of the previous.

Layer (2) above is new, with this patch.  The call used to check whether a
zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
extended to take a gfp_mask argument, and its logic is extended, in the case
that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
placement is allowed.  The definition of GFP_USER, which used to be identical
to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
cpuset_gfp_hardwall_flag patch.

GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
cpuset, so long as any node therein is not too tight on memory, but will
escape to the larger layer, if need be.

The intended use is to allow something like a batch manager to handle several
jobs, each job in its own cpuset, but using common kernel memory for caches
and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
task in or below one mem_exclusive cpuset should not cause swapping on nodes
in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
task in another such cpuset.  Heavy use of kernel memory for i/o caching and
such by one job should not impact the memory available to jobs in other
non-overlapping mem_exclusive cpusets.

This patch enables providing hardwall, inescapable cpusets for memory
allocations of each job, while sharing kernel memory allocations between
several jobs, in an enclosing mem_exclusive cpuset.

Like Dinakar's patch earlier to enable administering sched domains using the
cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
that had previously done nothing much useful other than restrict what cpuset
configurations were allowed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:40 -07:00
Pekka Enberg
39ed3fdeec [PATCH] futex: remove duplicate code
This patch cleans up the error path of futex_fd() by removing duplicate
code.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:33 -07:00
Oleg Nesterov
e752dd6cc6 [PATCH] fix send_sigqueue() vs thread exit race
posix_timer_event() first checks that the thread (SIGEV_THREAD_ID case)
does not have PF_EXITING flag, then it calls send_sigqueue() which locks
task list.  But if the thread exits in between the kernel will oops
(->sighand == NULL after __exit_sighand).

This patch moves the PF_EXITING check into the send_sigqueue(), it must be
done atomically under tasklist_lock.  When send_sigqueue() detects exiting
thread it returns -1.  In that case posix_timer_event will send the signal
to thread group.

Also, this patch fixes task_struct use-after-free in posix_timer_event.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:33 -07:00
Jesper Juhl
0730ded5be [PATCH] remove a redundant variable in sys_prctl()
The patch removes a redundant variable `sig' from sys_prctl().

For some reason, when sys_prctl is called with option == PR_SET_PDEATHSIG
then the value of arg2 is assigned to an int variable named sig.  Then sig
is tested with valid_signal() and later used to set the value of
current->pdeath_signal .

There is no reason to use this intermediate variable since valid_signal()
takes a unsigned long argument, so it can handle being passed arg2
directly, and if the call to valid_signal is OK, then we know the value of
arg2 is in the range zero to _NSIG and thus it'll easily fit in a plain int
and thus there's no problem assigning it later to current->pdeath_signal
(which is an int).

The patch gets rid of the pointless variable `sig'.
This reduces the size of kernel/sys.o in 2.6.13-rc6-mm1 by 32 bytes on my
system.

Patch has been compile tested, boot tested, and just to make damn sure I
didn't break anything I wrote a quick test app that calls
prctl(PR_SET_PDEATHSIG ...) with the entire range of values for a
unsigned long, and it behaves as expected with and without the patch.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:32 -07:00
Peter Staubach
6c9c0b52b8 [PATCH] largefile support for accounting
There is a problem in the accounting subsystem in the kernel can not
correctly handle files larger than 2GB.  The output file containing the
process accounting data can grow very large if the system is large enough
and active enough.  If the 2GB limit is reached, then the system simply
stops storing process accounting data.

Another annoying problem is that once the system reaches this 2GB limit,
then every process which exits will receive a signal, SIGXFSZ.  This signal
is generated because an attempt was made to write beyond the limit for the
file descriptor.  This signal makes it look like every process has exited
due to a signal, when in fact, they have not.

The solution is to add the O_LARGEFILE flag to the list of flags used to
open the accounting file.  The rest of the accounting support is already
largefile safe.

The changes were tested by constructing a large file (just short of 2GB),
enabling accounting, and then running enough commands to cause the
accounting data generated to increase the size of the file to 2GB.  Without
the changes, the file grows to 2GB and the last command run in the test
script appears to exit due a signal when it has not.  With the changes,
things work as expected and quietly.

There are some user level changes required so that it can deal with
largefiles, but those are being handled separately.

Signed-off-by: Peter Staubach <staubach@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:31 -07:00
Oleg Nesterov
bc505a478d [PATCH] do_notify_parent_cldstop() cleanup
This patch simplifies the usage of do_notify_parent_cldstop(), it lessens
the source and .text size slightly, and makes the code (in my opinion) a
bit more readable.

I am sending this patch now because I'm afraid Paul will touch
do_notify_parent_cldstop() really soon, It's better to cleanup first.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:31 -07:00
Karsten Wiese
f26fdd5992 [PATCH] CHECK_IRQ_PER_CPU() to avoid dead code in __do_IRQ()
IRQ_PER_CPU is not used by all architectures.  This patch introduces the
macros ARCH_HAS_IRQ_PER_CPU and CHECK_IRQ_PER_CPU() to avoid the generation
of dead code in __do_IRQ().

ARCH_HAS_IRQ_PER_CPU is defined by architectures using IRQ_PER_CPU in their
include/asm_ARCH/irq.h file.

Through grepping the tree I found the following architectures currently use
IRQ_PER_CPU:

        cris, ia64, ppc, ppc64 and parisc.

Signed-off-by: Karsten Wiese <annabellesgarden@yahoo.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:29 -07:00
Mika Kukkonen
230649da7c [PATCH] create_workqueue_thread() signedness fix
With "-W -Wno-unused -Wno-sign-compare" I get the following compile warning:

  CC      kernel/workqueue.o
kernel/workqueue.c: In function `workqueue_cpu_callback':
kernel/workqueue.c:504: warning: ordered comparison of pointer with integer zero

On error create_workqueue_thread() returns NULL, not negative pointer, so
following trivial patch suggests itself.

Signed-off-by: Mika Kukkonen <mikukkon@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:28 -07:00
Thomas Koeller
378bac820b [PATCH] flush icache early when loading module
Change the sequence of operations performed during module loading to flush
the instruction cache before module parameters are processed.  If a module
has parameters of an unusual type that cannot be handled using the standard
accessor functions param_set_xxx and param_get_xxx, it has to to provide a
set of accessor functions for this type.  This requires module code to be
executed during parameter processing, which is of course only possible
after the icache has been flushed.

Signed-off-by: Thomas Koeller <thomas@koeller.dyndns.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:26 -07:00
Alex Williamson
486d46aefe [PATCH] optimize writer path in time_interpolator_get_counter()
Christoph Lameter <clameter@engr.sgi.com>

When using a time interpolator that is susceptible to jitter there's
potentially contention over a cmpxchg used to prevent time from going
backwards.  This is unnecessary when the caller holds the xtime write
seqlock as all readers will be blocked from returning until the write is
complete.  We can therefore allow writers to insert a new value and exit
rather than fight with CPUs who only hold a reader lock.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:24 -07:00
David Howells
fe21773d65 [PATCH] Provide better printk() support for SMP machines
The attached patch prevents oopses interleaving with characters from
other printks on other CPUs by only breaking the lock if the oops is
happening on the machine holding the lock.

It might be better if the oops generator got the lock and then called an
inner vprintk routine that assumed the caller holds the lock, thus
making oops reports "atomic".

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:18 -07:00
Ingo Molnar
8446f1d391 [PATCH] detect soft lockups
This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP.

When enabled then per-CPU watchdog threads are started, which try to run
once per second.  If they get delayed for more than 10 seconds then a
callback from the timer interrupt detects this condition and prints out a
warning message and a stack dump (once per lockup incident).  The feature
is otherwise non-intrusive, it doesnt try to unlock the box in any way, it
only gets the debug info out, automatically, and on all CPUs affected by
the lockup.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-Off-By: Matthias Urlichs <smurf@smurf.noris.de>
Signed-off-by: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:17 -07:00
Jakub Jelinek
4732efbeb9 [PATCH] FUTEX_WAKE_OP: pthread_cond_signal() speedup
ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter
(which at least on UP usually means an immediate context switch to one of
the waiter threads).  This waiter wakes up and after a few instructions it
attempts to acquire the cv internal lock, but that lock is still held by
the thread calling pthread_cond_signal.  So it goes to sleep and eventually
the signalling thread is scheduled in, unlocks the internal lock and wakes
the waiter again.

Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal
to avoid this performance issue, but it was removed when locks were
redesigned to the 3 state scheme (unlocked, locked uncontended, locked
contended).

Following scenario shows why simply using FUTEX_REQUEUE in
pthread_cond_signal together with using lll_mutex_unlock_force in place of
lll_mutex_unlock is not enough and probably why it has been disabled at
that time:

The number is value in cv->__data.__lock.
        thr1            thr2            thr3
0       pthread_cond_wait
1       lll_mutex_lock (cv->__data.__lock)
0       lll_mutex_unlock (cv->__data.__lock)
0       lll_futex_wait (&cv->__data.__futex, futexval)
0                       pthread_cond_signal
1                       lll_mutex_lock (cv->__data.__lock)
1                                       pthread_cond_signal
2                                       lll_mutex_lock (cv->__data.__lock)
2                                         lll_futex_wait (&cv->__data.__lock, 2)
2                       lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock)
                          # FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE
2                       lll_mutex_unlock_force (cv->__data.__lock)
0                         cv->__data.__lock = 0
0                         lll_futex_wake (&cv->__data.__lock, 1)
1       lll_mutex_lock (cv->__data.__lock)
0       lll_mutex_unlock (cv->__data.__lock)
          # Here, lll_mutex_unlock doesn't know there are threads waiting
          # on the internal cv's lock

Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal,
but it will cost us not one, but 2 extra syscalls and, what's worse, one of
these extra syscalls will be done for every single waiting loop in
pthread_cond_*wait.

We would need to use lll_mutex_unlock_force in pthread_cond_signal after
requeue and lll_mutex_cond_lock in pthread_cond_*wait after lll_futex_wait.

Another alternative is to do the unlocking pthread_cond_signal needs to do
(the lock can't be unlocked before lll_futex_wake, as that is racy) in the
kernel.

I have implemented both variants, futex-requeue-glibc.patch is the first
one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel.
 The kernel interface allows userland to specify how exactly an unlocking
operation should look like (some atomic arithmetic operation with optional
constant argument and comparison of the previous futex value with another
constant).

It has been implemented just for ppc*, x86_64 and i?86, for other
architectures I'm including just a stub header which can be used as a
starting point by maintainers to write support for their arches and ATM
will just return -ENOSYS for FUTEX_WAKE_OP.  The requeue patch has been
(lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running
32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL.

With the following benchmark on UP x86-64 I get:

for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \
for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done
time elf/ld.so --library-path .:nptl-orig /tmp/bench
real 0m0.655s user 0m0.253s sys 0m0.403s
real 0m0.657s user 0m0.269s sys 0m0.388s
time elf/ld.so --library-path .:nptl-requeue /tmp/bench
real 0m0.496s user 0m0.225s sys 0m0.271s
real 0m0.531s user 0m0.242s sys 0m0.288s
time elf/ld.so --library-path .:nptl-wake_op /tmp/bench
real 0m0.380s user 0m0.176s sys 0m0.204s
real 0m0.382s user 0m0.175s sys 0m0.207s

The benchmark is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt
Older futex-requeue-glibc.patch version is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt
Older futex-wake_op-glibc.patch version is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt
Will post a new version (just x86-64 fixes so that the patch
applies against pthread_cond_signal.S) to libc-hacker ml soon.

Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded
testcase that will not test the atomicity of the operation, but at least
check if the threads that should have been woken up are woken up and
whether the arithmetic operation in the kernel gave the expected results.

Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:17 -07:00
Pavel Machek
d7ae79c72d [PATCH] swsusp: update documentation
This updates documentation a bit (mostly removing obsolete stuff), and
marks swsusp as no longer experimental in config.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:16 -07:00
Ashok Raj
54d5d42404 [PATCH] x86/x86_64: deferred handling of writes to /proc/irqxx/smp_affinity
When handling writes to /proc/irq, current code is re-programming rte
entries directly. This is not recommended and could potentially cause
chipset's to lockup, or cause missing interrupts.

CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the
interrupt is pending. The same needs to be done for /proc/irq handling as well.
Otherwise user space irq balancers are really not doing the right thing.

- Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for
  lack of a generic name.
- added move_irq out of IRQ_BALANCE, and added this same to X86_64
- Added new proc handler for write, so we can do deferred write at irq
  handling time.
- Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead
  it now shows only active cpu masks, or exactly what was set.
- Provided a common move_irq implementation, instead of duplicating
  when using generic irq framework.

Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off.
Tested UP builds as well.

MSI testing: tbd: I have cards, need to look for a x-over cable, although I
did test an earlier version of this patch.  Will test in a couple days.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Zwane Mwaikambo <zwane@holomorphy.com>
Grudgingly-acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:15 -07:00
Jeff Garzik
344babaa9d [kernel-doc] fix various DocBook build problems/warnings
Most serious is fixing include/sound/pcm.h, which breaks the DocBook
build.

The other stuff is just filling in things that cause warnings.
2005-09-07 01:15:17 -04:00
Laurent Vivier
ed75e8d580 [PATCH] UML Support - Ptrace: adds the host SYSEMU support, for UML and general usage
Jeff Dike <jdike@addtoit.com>,
      Paolo 'Blaisorblade' Giarrusso <blaisorblade_spam@yahoo.it>,
      Bodo Stroesser <bstroesser@fujitsu-siemens.com>

Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL
except that the kernel does not execute the requested syscall; this is useful
to improve performance for virtual environments, like UML, which want to run
the syscall on their own.

In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry
and on exit, and each time you also have two context switches; with SYSEMU you
avoid the 2nd stop and so save two context switches per syscall.

Also, some architectures don't have support in the host for changing the
syscall number via ptrace(), which is currently needed to skip syscall
execution (UML turns any syscall into getpid() to avoid it being executed on
the host).  Fixing that is hard, while SYSEMU is easier to implement.

* This version of the patch includes some suggestions of Jeff Dike to avoid
  adding any instructions to the syscall fast path, plus some other little
  changes, by myself, to make it work even when the syscall is executed with
  SYSENTER (but I'm unsure about them). It has been widely tested for quite a
  lot of time.

* Various fixed were included to handle the various switches between
  various states, i.e. when for instance a syscall entry is traced with one of
  PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit.
  Basically, this is done by remembering which one of them was used even after
  the call to ptrace_notify().

* We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP
  to make do_syscall_trace() notice that the current syscall was started with
  SYSEMU on entry, so that no notification ought to be done in the exit path;
  this is a bit of a hack, so this problem is solved in another way in next
  patches.

* Also, the effects of the patch:
"Ptrace - i386: fix Syscall Audit interaction with singlestep"
are cancelled; they are restored back in the last patch of this series.

Detailed descriptions of the patches doing this kind of processing follow (but
I've already summed everything up).

* Fix behaviour when changing interception kind #1.

  In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag
  only after doing the debugger notification; but the debugger might have
  changed the status of this flag because he continued execution with
  PTRACE_SYSCALL, so this is wrong.  This patch fixes it by saving the flag
  status before calling ptrace_notify().

* Fix behaviour when changing interception kind #2:
  avoid intercepting syscall on return when using SYSCALL again.

  A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL
  crashes.

  The problem is in arch/i386/kernel/entry.S.  The current SYSEMU patch
  inhibits the syscall-handler to be called, but does not prevent
  do_syscall_trace() to be called after this for syscall completion
  interception.

  The appended patch fixes this.  It reuses the flag TIF_SYSCALL_EMU to
  remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since
  the flag is unused in the depicted situation.

* Fix behaviour when changing interception kind #3:
  avoid intercepting syscall on return when using SINGLESTEP.

  When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had
  problems with singlestepping on UML in SKAS with SYSEMU.  It looped
  receiving SIGTRAPs without moving forward.  EIP of the traced process was
  the same for all SIGTRAPs.

What's missing is to handle switching from PTRACE_SYSCALL_EMU to
PTRACE_SINGLESTEP in a way very similar to what is done for the change from
PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE.

I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is
notified and then wake ups the process; the syscall is executed (or skipped,
when do_syscall_trace() returns 0, i.e.  when using PTRACE_SYSEMU), and
do_syscall_trace() is called again.  Since we are on the return path of a
SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL),
we must still avoid notifying the parent of the syscall exit.  Now, this
behaviour is extended even to resuming with PTRACE_SINGLESTEP.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:20 -07:00
Pavel Machek
57c4ce3cbf [PATCH] pm: clean up /sys/power/disk
Clean code up a bit, and only show suspend to disk as available when
it is configured in.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:18 -07:00
Pavel Machek
6161b2ce81 [PATCH] pm: fix process freezing
If process freezing fails, some processes are frozen, and rest are left in
"were asked to be frozen" state.  Thats wrong, we should leave it in some
consistent state.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:17 -07:00
Pavel Machek
99dc7d63e0 [PATCH] swsusp: fix error handling and cleanups
Drop printing during normal boot (when no image exists in swap), print
message when drivers fail, fix error paths and consolidate near-identical
functions in disk.c (and functions with just one statement).

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:17 -07:00
Shaohua Li
dd5d666b79 [PATCH] swsusp: add locking to software_resume
It is trying to protect swsusp_resume_device and software_resume() from two
users banging it from userspace at the same time.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:17 -07:00
Michal Schmidt
56057e1a12 [PATCH] swsusp: simpler calculation of number of pages in PBE list
The function calc_nr uses an iterative algorithm to calculate the number of
pages needed for the image and the pagedir.  Exactly the same result can be
obtained with a one-line expression.

Note that this was even proved correct ;-).

Signed-off-by: Michal Schmidt <xschmi00@stud.feec.vutbr.cz>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:17 -07:00
Andreas Steinmetz
c2ff18f407 [PATCH] encrypt suspend data for easy wiping
The patch protects from leaking sensitive data after resume from suspend.
During suspend a temporary key is created and this key is used to encrypt the
data written to disk.  When, during resume, the data was read back into memory
the temporary key is destroyed which simply means that all data written to
disk during suspend are then inaccessible so they can't be stolen lateron.

Think of the following: you suspend while an application is running that keeps
sensitive data in memory.  The application itself prevents the data from being
swapped out.  Suspend, however, must write these data to swap to be able to
resume lateron.  Without suspend encryption your sensitive data are then
stored in plaintext on disk.  This means that after resume your sensitive data
are accessible to all applications having direct access to the swap device
which was used for suspend.  If you don't need swap after resume these data
can remain on disk virtually forever.  Thus it can happen that your system
gets broken in weeks later and sensitive data which you thought were encrypted
and protected are retrieved and stolen from the swap device.

Signed-off-by: Andreas Steinmetz <ast@domdv.de>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:16 -07:00
Pavel Machek
2a23b5d1e1 [PATCH] remove busywait in refrigerator
This should make refrigerator sleep properly, not busywait after the first
schedule() returns.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:06:14 -07:00
Hugh Dickins
dae06ac43d [PATCH] swap: update swsusp use of swap_info
Aha, swsusp dips into swap_info[], better update it to swap_lock.  It's
bitflipping flags with 0xFF, so get_swap_page will allocate from only the one
chosen device: let's change that to flip SWP_WRITEOK.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:42 -07:00
Len Brown
129521dcc9 Merge linux-2.6 into linux-acpi-2.6 test 2005-09-03 02:44:09 -04:00
Arnaldo Carvalho de Melo
20380731bc [NET]: Fix sparse warnings
Of this type, mostly:

CHECK   net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:32 -07:00
Patrick McHardy
066286071d [NETLINK]: Add "groups" argument to netlink_kernel_create
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 16:01:11 -07:00
Harald Welte
4fdb3bb723 [NETLINK]: Add properly module refcounting for kernel netlink sockets.
- Remove bogus code for compiling netlink as module
- Add module refcounting support for modules implementing a netlink
  protocol
- Add support for autoloading modules that implement a netlink protocol
  as soon as someone opens a socket for that protocol

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:35:08 -07:00
David Woodhouse
efda945204 Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-08-27 14:30:07 +02:00
David Woodhouse
b01f2cc1c3 [AUDIT] Allow filtering on system call success _or_ failure
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-27 10:25:43 +01:00
Len Brown
60cfff3516 Auto-update from upstream 2005-08-26 22:11:28 -04:00
Paul Jackson
212d6d2237 [PATCH] completely disable cpu_exclusive sched domain
At the suggestion of Nick Piggin and Dinakar, totally disable
the facility to allow cpu_exclusive cpusets to define dynamic
sched domains in Linux 2.6.13, in order to avoid problems
first reported by John Hawkes (corrupt sched data structures
and kernel oops).

This has been built for ppc64, i386, ia64, x86_64, sparc, alpha.
It has been built, booted and tested for cpuset functionality
on an SN2 (ia64).

Dinakar or Nick - could you verify that it for sure does avoid
the problems Hawkes reported.  Hawkes is out of town, and I don't
have the recipe to reproduce what he found.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-26 16:38:47 -07:00
Paul Jackson
ca2f3daf77 [PATCH] undo partial cpu_exclusive sched domain disabling
The partial disabling of Dinakar's new facility to allow
cpu_exclusive cpusets to define dynamic sched domains
doesn't go far enough.  At the suggestion of Nick Piggin
and Dinakar, let us instead totally disable this facility
for 2.6.13, in order to avoid problems first reported
by John Hawkes (corrupt sched data structures and kernel oops).

This patch removes the partial disabling code in 2.6.13-rc7,
in anticipation of the next patch, which will totally disable
it instead.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-26 16:38:46 -07:00
Len Brown
09d4a80e66 Merge HEAD from ../from-linus 2005-08-25 12:45:49 -04:00
Len Brown
eb7b6b3264 [ACPI] IA64-related ACPI Kconfig fixes
Build issues were mostly in the ACPI=n case -- don't do that.
Select ACPI from IA64_GENERIC.
Add some missing dependencies on ACPI.

Mark BLACKLIST_YEAR and some laptop-only ACPI drivers
as X86-only.  Let me know when you get an IA64 Laptop.

Signed-off-by: Len Brown <len.brown@intel.com>
2005-08-25 12:14:20 -04:00
Paul Jackson
3725822f7c [PATCH] cpu_exclusive sched domains build fix
As reported by Paul Mackerras <paulus@samba.org>, the previous patch
"cpu_exclusive sched domains fix" broke the ppc64 build with
CONFIC_CPUSET, yielding error messages:

kernel/cpuset.c: In function 'update_cpu_domains':
kernel/cpuset.c:648: error: invalid lvalue in unary '&'
kernel/cpuset.c:648: error: invalid lvalue in unary '&'

On some arch's, the node_to_cpumask() is a function, returning
a cpumask_t.  But the for_each_cpu_mask() requires an lvalue mask.

The following patch fixes this build failure by making a copy
of the cpumask_t on the stack.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-24 09:40:45 -07:00
Paul Jackson
d10689b68a [PATCH] cpu_exclusive sched domains on partial nodes temp fix
This keeps the kernel/cpuset.c routine update_cpu_domains() from
invoking the sched.c routine partition_sched_domains() if the cpuset in
question doesn't fall on node boundaries.

I have boot tested this on an SN2, and with the help of a couple of ad
hoc printk's, determined that it does indeed avoid calling the
partition_sched_domains() routine on partial nodes.

I did not directly verify that this avoids setting up bogus sched
domains or avoids the oops that Hawkes saw.

This patch imposes a silent artificial constraint on which cpusets can
be used to define dynamic sched domains.

This patch should allow proceeding with this new feature in 2.6.13 for
the configurations in which it is useful (node alligned sched domains)
while avoiding trying to setup sched domains in the less useful cases
that can cause the kernel corruption and oops.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-23 20:02:52 -07:00
David Meybohm
4c5640cb5f [PATCH] preempt race in getppid
With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to
return a bogus value if the parent's task_struct gets reallocated after
current->group_leader->real_parent is read:

        asmlinkage long sys_getppid(void)
        {
                int pid;
                struct task_struct *me = current;
                struct task_struct *parent;

                parent = me->group_leader->real_parent;
RACE HERE =>    for (;;) {
                        pid = parent->tgid;
        #ifdef CONFIG_SMP
        {
                        struct task_struct *old = parent;

                        /*
                         * Make sure we read the pid before re-reading the
                         * parent pointer:
                         */
                        smp_rmb();
                        parent = me->group_leader->real_parent;
                        if (old != parent)
                                continue;
        }
        #endif
                        break;
                }
                return pid;
        }

If the process gets preempted at the indicated point, the parent process
can go ahead and call exit() and then get wait()'d on to reap its
task_struct. When the preempted process gets resumed, it will not do any
further checks of the parent pointer on !CONFIG_SMP: it will read the
bad pid and return.

So, the same algorithm used when SMP is enabled should be used when
preempt is enabled, which will recheck ->real_parent in this case.

Signed-off-by: David Meybohm <dmeybohmlkml@bellsouth.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-23 11:44:29 -07:00
Matt Mackall
024f474795 [PATCH] Make RLIMIT_NICE ranges consistent with getpriority(2)
As suggested by Michael Kerrisk <mtk-manpages@gmx.net>, make RLIMIT_NICE
consistent with getpriority before it becomes available in released glibc.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-18 12:53:58 -07:00
Bhavesh P. Davda
dd12f48d4e [PATCH] NPTL signal delivery deadlock fix
This bug is quite subtle and only happens in a very interesting
situation where a real-time threaded process is in the middle of a
coredump when someone whacks it with a SIGKILL.  However, this deadlock
leaves the system pretty hosed and you have to reboot to recover.

Not good for real-time priority-preemption applications like our
telephony application, with 90+ real-time (SCHED_FIFO and SCHED_RR)
processes, many of them multi-threaded, interacting with each other for
high volume call processing.

Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-17 12:52:04 -07:00
Amy Griffis
3c789a1905 AUDIT: Prevent duplicate syscall rules
The following patch against audit.81 prevents duplicate syscall rules in
a given filter list by walking the list on each rule add.

I also removed the unused struct audit_entry in audit.c and made the
static inlines in auditsc.c consistent.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-17 16:05:35 +01:00
David Woodhouse
c389649594 AUDIT: Speed up audit_filter_syscall() for the non-auditable case.
It was showing up fairly high on profiles even when no rules were set.
Make sure the common path stays as fast as possible.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-17 14:49:57 +01:00
David Woodhouse
413a1c7520 AUDIT: Fix task refcount leak in audit_filter_syscall()
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-08-17 14:45:55 +01:00
David Woodhouse
327b6b08d6 Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-08-17 14:37:55 +01:00
James Bottomley
6068674437 [PATCH] remove name length check in a workqueue
We have a chek in there to make sure that the name won't overflow
task_struct.comm[], but it's triggering for scsi with lots of HBAs, only
scsi is using single-threaded workqueues which don't append the "/%d"
anyway.

All too hard.  Just kill the BUG_ON.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>

[ kthread_create() uses vsnprintf() and limits the thing, so no
  actual overflow can actually happen regardless ]

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-10 11:55:19 -07:00
Paul Jackson
3077a260e9 [PATCH] cpuset release ABBA deadlock fix
Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set.

For a particular usage pattern, creating and destroying cpusets fairly
frequently using notify_on_release, on a very large system, this deadlock
can be seen every few days.  If you are not using the cpuset
notify_on_release feature, you will never see this deadlock.

The existing code, on task exit (or cpuset deletion) did:

  get cpuset_sem
  if cpuset marked notify_on_release and is ready to release:
    compute cpuset path relative to /dev/cpuset mount point
    call_usermodehelper() forks /sbin/cpuset_release_agent with path
  drop cpuset_sem

Unfortunately, the fork in call_usermodehelper can allocate memory, and
allocating memory can require cpuset_sem, if the mems_generation values
changed in the interim.  This results in an ABBA deadlock, trying to obtain
cpuset_sem when it is already held by the current task.

To fix this, I put the cpuset path (which must be computed while holding
cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper
call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem.

So the new logic is:

  get cpuset_sem
  if cpuset marked notify_on_release and is ready to release:
    compute cpuset path relative to /dev/cpuset mount point
    stash path in kmalloc'd buffer
  drop cpuset_sem
  call_usermodehelper() forks /sbin/cpuset_release_agent with path
  free path

The sharp eyed reader might notice that this patch does not contain any
calls to kmalloc.  The existing code in the check_for_release() routine was
already kmalloc'ing a buffer to hold the cpuset path.  In the old code, it
just held the buffer for a few lines, over the cpuset_release_agent() call
that in turn invoked call_usermodehelper().  In the new code, with the
application of this patch, it returns that buffer via the new char
**ppathbuf parameter, for later use and freeing in cpuset_release_agent(),
which is called after cpuset_sem is dropped.  Whereas the old code has just
one call to cpuset_release_agent(), right in the check_for_release()
routine, the new code has three calls to cpuset_release_agent(), from the
various places that a cpuset can be released.

This patch has been build and booted on SN2, and passed a stress test that
previously hit the deadlock within a few seconds.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-09 12:08:22 -07:00
David Woodhouse
c973b112c7 Merge with /shiny/git/linux-2.6/.git 2005-08-09 16:51:35 +01:00
Andrew Morton
c306895167 [PATCH] revert "timer exit cleanup"
Revert this June 17 patch: it broke persistence of timers across execve().

Cc: Roland McGrath <roland@redhat.com>
Cc: george anzinger <george@mvista.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 16:57:49 -07:00
Benjamin Herrenschmidt
c36f19e02a [PATCH] Remove suspend() calls from shutdown path
This removes the calls to device_suspend() from the shutdown path that
were added sometime during 2.6.13-rc*.  They aren't working properly on
a number of configs (I got reports from both ppc powerbook users and x86
users) causing the system to not shutdown anymore.

I think it isn't the right approach at the moment anyway.  We have
already a shutdown() callback for the drivers that actually care about
shutdown and the suspend() code isn't yet in a good enough shape to be
so much generalized.  Also, the semantics of suspend and shutdown are
slightly different on a number of setups and the way this was patched in
provides little way for drivers to cleanly differenciate.  It should
have been at least a different message.

For 2.6.13, I think we should revert to 2.6.12 behaviour and have a
working suspend back.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 08:20:47 -07:00
Rusty Russell
842bbaaa73 [PATCH] Module per-cpu alignment cannot always be met
The module code assumes noone will ever ask for a per-cpu area more than
SMP_CACHE_BYTES aligned.  However, as these cases show, gcc asks sometimes
asks for 32-byte alignment for the per-cpu section on a module, and if
CONFIG_X86_L1_CACHE_SHIFT is 4, we hit that BUG_ON().  This is obviously an
unusual combination, as there have been few reports, but better to warn
than die.

See:
	http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0768.html

And more recently:
	http://bugs.gentoo.org/show_bug.cgi?id=97006

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 21:38:01 -07:00
Ingo Molnar
6cb54819d7 [PATCH] remove sys_set_zone_reclaim()
This removes sys_set_zone_reclaim() for now.  While i'm sure Martin is
trying to solve a real problem, we must not hard-code an incomplete and
insufficient approach into a syscall, because syscalls are pretty much
for eternity.  I am quite strongly convinced that this syscall must not
hit v2.6.13 in its current form.

Firstly, the syscall lacks basic syscall design: e.g. it allows the
global setting of VM policy for unprivileged users. (!) [ Imagine an
Oracle installation and a SAP installation on the same NUMA box fighting
over the 'optimal' setting for this flag. What will they do? Will they
try to set the flag to their own preferred value every second or so? ]

Secondly, it was added based on a single datapoint from Martin:

 http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2

where Martin characterizes the numbers the following way:

 ' Run-to-run variability for "make -j" is huge, so these numbers aren't
   terribly useful except to see that with reclaim the benchmark still
   finishes in a reasonable amount of time. '

in other words: the fundamental problem has likely not been solved, only
a tendential move into the right direction has been observed, and a
handful of numbers were picked out of a set of hugely variable results,
without showing the variability data. How much variance is there
run-to-run?

I'd really suggest to first walk the walk and see what's needed to get
stable & predictable kernel compilation numbers on that NUMA box, before
adding random syscalls to tune a particular aspect of the VM ... which
approach might not even matter once the whole picture has been analyzed
and understood!

The third, most important point is that the syscall exposes VM tuning
internals in a completely unstructured way. What sense does it make to
have a _GLOBAL_ per-node setting for 'should we go to another node for
reclaim'? If then it might make sense to do this per-app, via numalib or
so.

The change is minimalistic in that it doesnt remove the syscall and the
underlying infrastructure changes, only the user-visible changes.  We
could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
/proc/sys/vm/swappiness, but even that looks quite counterproductive
when the generic approach is that we are trying to reduce the number of
external factors in the VM balance picture.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 10:03:56 -07:00
Andrew Morton
c70f5d6610 [PATCH] revert bogus softirq changes
This snuck in with an x86_64 change.  Thanks to Richard Purdie
<rpurdie@rpsys.net> for spotting it.

Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-30 10:49:59 -07:00
Eric W. Biederman
1108bae41e [PATCH] reboot: remove device_suspend(PMSG_FREEZE) from kernel_kexec
If device_suspend(PMSG_FREEZE) is not ready to be called in
kernel_restart it is definitely not ready to be called in the even more
fickle kernel_kexec.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-29 12:02:09 -07:00
George Anzinger
78fa74a23b [PATCH] posix timers: fix normalization problem
(We found this (after a customer complained) and it is in the kernel.org
kernel.  Seems that for CLOCK_MONOTONIC absolute timers and clock_nanosleep
calls both the request time and wall_to_monotonic are subtracted prior to
the normalize resulting in an overflow in the existing normalize test.
This causes the result to be shifted ~4 seconds ahead instead of ~2 seconds
back in time.)

The normalize code in posix-timers.c fails when the tv_nsec member is ~1.2
seconds negative.  This can happen on absolute timers (and
clock_nanosleeps) requested on CLOCK_MONOTONIC (both the request time and
wall_to_monotonic are subtracted resulting in the possibility of a number
close to -2 seconds.)

This fix uses the set_normalized_timespec() (which does not have an
overflow problem) to fix the problem and as a side effect makes the code
cleaner.

Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28 21:46:05 -07:00
Andi Kleen
ed6b676ca8 [PATCH] x86_64: Switch to the interrupt stack when running a softirq in local_bh_enable()
This avoids some potential stack overflows with very deep softirq callchains.
i386 does this too.

TOADD CFI annotation

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-28 21:46:02 -07:00
Andrew Morton
e4ff4d7f9d [PATCH] Avoid device suspend on reboot
My fairly ordinary x86 test box gets stuck during reboot on the
wait_for_completion() in ide_do_drive_cmd():

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:46:37 -07:00
Jesper Juhl
77933d7276 [PATCH] clean up inline static vs static inline
`gcc -W' likes to complain if the static keyword is not at the beginning of
the declaration.  This patch fixes all remaining occurrences of "inline
static" up with "static inline" in the entire kernel tree (140 occurrences in
47 files).

While making this change I came across a few lines with trailing whitespace
that I also fixed up, I have also added or removed a blank line or two here
and there, but there are no functional changes in the patch.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:20 -07:00
Randy Dunlap
e77e17161c [PATCH] kernel/crash_dump.c: add kerneldoc
Add kerneldoc to kernel/crash_dump.c

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:06 -07:00
Randy Dunlap
d9fd8a6d44 [PATCH] kernel/cpuset.c: add kerneldoc, fix typos
Add kerneldoc to kernel/cpuset.c

Fix cpuset typos in init/Kconfig

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:06 -07:00
Randy Dunlap
207a7ba8dc [PATCH] kernel/capability.c: add kerneldoc
Add kerneldoc to kernel/capability.c

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:06 -07:00
Martin Schwidefsky
951f22d5b1 [PATCH] s390: spin lock retry
Split spin lock and r/w lock implementation into a single try which is done
inline and an out of line function that repeatedly tries to get the lock
before doing the cpu_relax().  Add a system control to set the number of
retries before a cpu is yielded.

The reason for the spin lock retry is that the diagnose 0x44 that is used to
give up the virtual cpu is quite expensive.  For spin locks that are held only
for a short period of time the costs of the diagnoses outweights the savings
for spin locks that are held for a longer timer.  The default retry count is
1000.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:26:04 -07:00
George Anzinger
d912d1ff21 [PATCH] itimer fixes
Fix the recent off-by-one fix in the itimer code:

1. The repeating timer is figured using the requested time
	(not +1 as we know where we are in the jiffie).

2. The tests for interval too large are left to the time_val to jiffie code.

Signed-off-by: George Anzinger <george@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:25:51 -07:00
Nigel Cunningham
bba0e4670a [PATCH] Address BUG: using smp_processor_id() in preemptible [00000001] code
This patch fixes a warning in the disable_nonboot_cpus call in
kernel/power/smp.c.

Signed-off by: Nigel Cunningham <nigel@suspend2.net>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27 16:25:50 -07:00
David Woodhouse
c5fbc3966f Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-07-27 14:14:13 +01:00
Steven Rostedt
d46523ea32 [PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO
Here's the patch again to fix the code to handle if the values between
MAX_USER_RT_PRIO and MAX_RT_PRIO are different.

Without this patch, an SMP system will crash if the values are
different.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 15:40:00 -07:00
Andreas Steinmetz
18586e7216 [PATCH] Fix RLIMIT_RTPRIO breakage
RLIMIT_RTPRIO is supposed to grant non privileged users the right to use
SCHED_FIFO/SCHED_RR scheduling policies with priorites bounded by the
RLIMIT_RTPRIO value via sched_setscheduler(). This is usually used by
audio users.

Unfortunately this is broken in 2.6.13rc3 as you can see in the excerpt
from sched_setscheduler below:

        /*
         * Allow unprivileged RT tasks to decrease priority:
         */
        if (!capable(CAP_SYS_NICE)) {
                /* can't change policy */
                if (policy != p->policy)
                        return -EPERM;

After the above unconditional test which causes sched_setscheduler to
fail with no regard to the RLIMIT_RTPRIO value the following check is made:

               /* can't increase priority */
                if (policy != SCHED_NORMAL &&
                    param->sched_priority > p->rt_priority &&
                    param->sched_priority >
                                p->signal->rlim[RLIMIT_RTPRIO].rlim_cur)
                        return -EPERM;

Thus I do believe that the RLIMIT_RTPRIO value must be taken into
account for the policy check, especially as the RLIMIT_RTPRIO limit is
of no use without this change.

The attached patch fixes this problem.

Signed-off-by: Andreas Steinmetz <ast@domdv.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 15:30:51 -07:00
Eric W. Biederman
fdde86ac50 [PATCH] swpsuspend: Have suspend to disk use factors of sys_reboot
The suspend to disk code was a poor copy of the code in
sys_reboot now that we have kernel_power_off, kernel_restart
and kernel_halt use them instead of poorly duplicating them inline.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:44 -07:00
Eric W. Biederman
2f048ea81d [PATCH] Call emergency_reboot from panic
We know the system is in trouble so there is no question if this
is an emergecy :)

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:43 -07:00
Eric W. Biederman
ff31977782 [PATCH] Use kernel_power_off in sysrq-o
We already do all of the gymnastics to run from process context
to call the power off code so call into the power off code cleanly.

This especially helps acpi as part of it's shutdown logic should
run acpi_shutdown called from device_shutdown which was not
being called from here.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:43 -07:00
Eric W. Biederman
7c9034735e [PATCH] Add emergency_restart()
When the kernel is working well and we want to restart cleanly
kernel_restart is the function to use.   But in many instances
the kernel wants to reboot when thing are expected to be working
very badly such as from panic or a software watchdog handler.

This patch adds the function emergency_restart() so that
callers can be clear what semantics they expect when calling
restart.  emergency_restart() is expected to be callable
from interrupt context and possibly reliable in even more
trying circumstances.

This is an initial generic implementation for all architectures.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:41 -07:00
Eric W. Biederman
abcd9e51f5 [PATCH] Make ctrl_alt_del call kernel_restart to get a proper reboot.
It is obvious we wanted to call kernel_restart here
but since we don't have it the code was expanded inline and hasn't
been correct since sometime in 2.4.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:41 -07:00
Eric W. Biederman
4a00ea1e18 [PATCH] Refactor sys_reboot into reusable parts
Because the factors of sys_reboot don't exist people calling
into the reboot path duplicate the code badly, leading to
inconsistent expectations of code in the reboot path.

This patch should is just code motion.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:41 -07:00
Eric W. Biederman
47f61f397c [PATCH] Add missing device_suspsend(PMSG_FREEZE) calls.
In the recent addition of device_suspend calls into
sys_reboot two code paths were missed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:41 -07:00
David Woodhouse
39299d9d15 Merge with /shiny/git/linux-2.6/.git 2005-07-19 17:49:39 -04:00
David Woodhouse
ce625a8016 AUDIT: Reduce contention in audit_serial()
... by generating serial numbers only if an audit context is actually
_used_, rather than doing so at syscall entry even when the context
isn't necessarily marked auditable.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-18 14:24:46 -04:00
David Woodhouse
d5b454f2c4 AUDIT: Fix livelock in audit_serial().
The tricks with atomic_t were bizarre. Just do it sensibly instead.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-15 12:56:03 +01:00
David Woodhouse
351bb72259 AUDIT: Fix compile error in audit_filter_syscall
We didn't rename it to audit_tgid after all. Except once... Doh.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-14 14:40:06 +01:00
David Woodhouse
f55619642e AUDIT: Avoid scheduling in idle thread
When we flush a pending syscall audit record due to audit_free(), we
might be doing that in the context of the idle thread. So use GFP_ATOMIC

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-13 22:47:07 +01:00
David Woodhouse
582edda586 AUDIT: Exempt the whole auditd thread-group from auditing
and not just the one thread.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-13 22:39:34 +01:00
Victor Fusco
6c8c8ba5d7 [AUDIT] Fix sparse warning about gfp_mask type
Fix the sparse warning "implicit cast to nocast type"

Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-13 22:26:57 +01:00
Robert Love
0399cb08c5 [PATCH] inotify: move sysctl
This moves the inotify sysctl knobs to "/proc/sys/fs/inotify" from
"/proc/sys/fs".  Also some related cleanup.

Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-13 11:09:31 -07:00
David Woodhouse
30beab1491 Merge with /shiny/git/linux-2.6/.git 2005-07-13 15:25:59 +01:00
Robert Love
0eeca28300 [PATCH] inotify
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:

        * dnotify requires the opening of one fd per each directory
          that you intend to watch. This quickly results in too many
          open files and pins removable media, preventing unmount.
        * dnotify is directory-based. You only learn about changes to
          directories. Sure, a change to a file in a directory affects
          the directory, but you are then forced to keep a cache of
          stat structures.
        * dnotify's interface to user-space is awful.  Signals?

inotify provides a more usable, simple, powerful solution to file change
notification:

        * inotify's interface is a system call that returns a fd, not SIGIO.
	  You get a single fd, which is select()-able.
        * inotify has an event that says "the filesystem that the item
          you were watching is on was unmounted."
        * inotify can watch directories or files.

Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.

See Documentation/filesystems/inotify.txt.

Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 20:38:38 -07:00
Linus Torvalds
3f603ed319 Merge master.kernel.org:/pub/scm/linux/kernel/git/lenb/linux-2.6 2005-07-12 16:04:50 -07:00
Hugh Dickins
3b6bfcdb11 [PATCH] lower VM_DONTCOPY total_vm
dup_mmap of a VM_DONTCOPY vma forgot to lower the child's total_vm.  (But
no way does this account for the recent report of total_vm seen too low.)

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 16:00:58 -07:00
Andrew Morton
d53d9f16ea [PATCH] name_to_dev_t warning fix
kernel/power/disk.c needs a declaration of name_to_dev_t() in scope.  mount.h
seems like an appropriate choice.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12 16:00:58 -07:00
Len Brown
5028770a42 [ACPI] merge acpi-2.6.12 branch into latest Linux 2.6.13-rc...
Signed-off-by: Len Brown <len.brown@intel.com>
2005-07-12 17:21:56 -04:00
David Shaohua Li
5ae947ecc9 [ACPI] Suspend to RAM fix
Free some RAM before entering S3 so that upon
resume we can be sure early allocations will succeed.

http://bugzilla.kernel.org/show_bug.cgi?id=3469

Signed-off-by: David Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2005-07-11 23:21:54 -04:00
Alexey Starikovskiy
e2a5b420f7 [ACPI] ACPI poweroff fix
Register an "acpi" system device to be notified of shutdown preparation.
This depends on CONFIG_PM

http://bugzilla.kernel.org/show_bug.cgi?id=4041

Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
2005-07-11 23:20:49 -04:00
Ingo Molnar
5bbcfd9000 [PATCH] cond_resched(): fix bogus might_sleep() warning
The BKS might be reacquired before we have dropped PREEMPT_ACTIVE, which
could trigger a second could trigger a second cond_resched() call.  Bug
found by Hirofumi Ogawa.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:47 -07:00
Christoph Lameter
6c036527a6 [PATCH] mostly_read data section
Add a new section called ".data.read_mostly" for data items that are read
frequently and rarely written to like cpumaps etc.

If these maps are placed in the .data section then these frequenly read
items may end up in cachelines with data is is frequently updated.  In that
case all processors in an SMP system must needlessly reload the cachelines
again and again containing elements of those frequently used variables.

The ability to share these cachelines will allow each cpu in an SMP system
to keep local copies of those shared cachelines thereby optimizing
performance.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:46 -07:00
Pavel Machek
1322ad4151 [PATCH] pm: clean up process.c
freezeable() already tests for TRACED/STOPPED processes, no need to do it
twice.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:43 -07:00
Pavel Machek
47b724f3fe [PATCH] swsusp: fix error handling
Fix error handling and whitespace in swsusp.c.  swsusp_free() was called when
there was nothing allocating, leading to oops.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:43 -07:00
Pavel Machek
3efa147ad7 [PATCH] pm: Fix resume from initrd
Move device name resolution code around so that it is not called from
resume-from-initrd.  name_to_dev_t may be unavailable at that point.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07 18:23:43 -07:00
Rusty Lynch
6772926bef [PATCH] kprobes: fix namespace problem and sparc64 build
The following renames arch_init, a kprobes function for performing any
architecture specific initialization, to arch_init_kprobes in order to
cleanup the namespace.

Also, this patch adds arch_init_kprobes to sparc64 to fix the sparc64 kprobes
build from the last return probe patch.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-05 19:19:00 -07:00
David Woodhouse
21af6c4f2a AUDIT: Really don't audit auditd.
The pid in the audit context isn't always set up. Use tsk->pid when 
checking whether it's auditd in audit_filter_syscall(), instead of 
ctx->pid. Remove a band-aid which did the same elsewhere.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-02 14:10:46 +01:00
David Woodhouse
ac4cec443a AUDIT: Stop waiting for backlog after audit_panic() happens
We force a rate-limit on auditable events by making them wait for space 
on the backlog queue. However, if auditd really is AWOL then this could 
potentially bring the entire system to a halt, depending on the audit 
rules in effect.

Firstly, make sure the wait time is honoured correctly -- it's the 
maximum time the process should wait, rather than the time to wait 
_each_ time round the loop. We were getting re-woken _each_ time a 
packet was dequeued, and the timeout was being restarted each time.

Secondly, reset the wait time after audit_panic() is called. In general 
this will be reset to zero, to allow progress to be made. If the system
is configured to _actually_ panic on audit_panic() then that will 
already have happened; otherwise we know that audit records are being 
lost anyway. 

These two tunables can't be exposed via AUDIT_GET and AUDIT_SET because 
those aren't particularly well-designed. It probably should have been 
done by sysctls or sysfs anyway -- one for a later patch.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-07-02 14:08:48 +01:00
David Woodhouse
d2f6409584 Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-07-02 13:39:09 +01:00
Alan Cox
200803dfe4 [PATCH] irqpoll
Anyone reporting a stuck IRQ should try these options.  Its effectiveness
varies we've found in the Fedora case.  Quite a few systems with misdescribed
IRQ routing just work when you use irqpoll.  It also fixes up the VIA systems
although thats now fixed with the VIA quirk (which we could just make default
as its what Redmond OS does but Linus didn't like it historically).

A small number of systems have jammed IRQ sources or misdescribes that cause
an IRQ that we have no handler registered anywhere for.  In those cases it
doesn't help.

Signed-off-by: Alan Cox <number6@the-village.bc.nu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 21:20:35 -07:00
Oleg Nesterov
f01b1b0baa [PATCH] ITIMER_REAL: fix possible deadlock and race
As Steven Rostedt pointed out, there are 2 problems with ITIMER_REAL
timers.

1. do_setitimer() does not call del_timer_sync() in case
   when the timer is not pending (it_real_value() returns 0).
   This is wrong, the timer may still be running, and it can
   rearm itself.

2. It calls del_timer_sync() with tsk->sighand->siglock held.
   This is deadlockable, because timer's handler needs this
   lock too.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 21:20:30 -07:00
Luca Falavigna
47f176fdaf [PATCH] Using msleep() instead of HZ
Use msleep() in a few places.

Signed-off-by: Luca Falavigna <dktrkranz@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 21:20:29 -07:00
Ingo Molnar
f340c0d1a3 [PATCH] Tweak idle thread setup semantics
This patch tweaks idle thread setup semantics a bit: instead of setting
NEED_RESCHED in init_idle(), we do an explicit schedule() before calling
into cpu_idle().

This patch, while having no negative side-effects, enables wider use of
cond_resched()s.  (which might happen in the stock kernel too, but it's
particulary important for voluntary-preempt)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 14:56:51 -07:00
Alexey Dobriyan
314b6a4d80 [PATCH] kexec: fix sparse warnings
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28 14:53:40 -07:00
Rusty Lynch
802eae7c80 [PATCH] Return probe redesign: architecture independent changes
The following is the second version of the function return probe patches
I sent out earlier this week.  Changes since my last submission include:

* Fix in ppc64 code removing an unneeded call to re-enable preemption
* Fix a build problem in ia64 when kprobes was turned off
* Added another BUG_ON check to each of the architecture trampoline
  handlers

My initial patch description ==>

 From my experiences with adding return probes to x86_64 and ia64, and the
feedback on LKML to those patches, I think we can simplify the design
for return probes.

The following patch tweaks the original design such that:

* Instead of storing the stack address in the return probe instance, the
  task pointer is stored.  This gives us all we need in order to:
    - find the correct return probe instance when we enter the trampoline
      (even if we are recursing)
    - find all left-over return probe instances when the task is going away

  This has the side effect of simplifying the implementation since more
  work can be done in kernel/kprobes.c since architecture specific knowledge
  of the stack layout is no longer required.  Specifically, we no longer have:
	- arch_get_kprobe_task()
	- arch_kprobe_flush_task()
	- get_rp_inst_tsk()
	- get_rp_inst()
	- trampoline_post_handler() <see next bullet>

* Instead of splitting the return probe handling and cleanup logic across
  the pre and post trampoline handlers, all the work is pushed into the
  pre function (trampoline_probe_handler), and then we skip single stepping
  the original function.  In this case the original instruction to be single
  stepped was just a NOP, and we can do without the extra interruption.

The new flow of events to having a return probe handler execute when a target
function exits is:

* At system initialization time, a kprobe is inserted at the beginning of
  kretprobe_trampoline.  kernel/kprobes.c use to handle this on it's own,
  but ia64 needed to do this a little differently (i.e. a function pointer
  is really a pointer to a structure containing the instruction pointer and
  a global pointer), so I added the notion of arch_init(), so that
  kernel/kprobes.c:init_kprobes() now allows architecture specific
  initialization by calling arch_init() before exiting.  Each architecture
  now registers a kprobe on it's own trampoline function.

* register_kretprobe() will insert a kprobe at the beginning of the targeted
  function with the kprobe pre_handler set to arch_prepare_kretprobe
  (still no change)

* When the target function is entered, the kprobe is fired, calling
  arch_prepare_kretprobe (still no change)

* In arch_prepare_kretprobe() we try to get a free instance and if one is
  available then we fill out the instance with a pointer to the return probe,
  the original return address, and a pointer to the task structure (instead
  of the stack address.)  Just like before we change the return address
  to the trampoline function and mark the instance as used.

  If multiple return probes are registered for a given target function,
  then arch_prepare_kretprobe() will get called multiple times for the same
  task (since our kprobe implementation is able to handle multiple kprobes
  at the same address.)  Past the first call to arch_prepare_kretprobe,
  we end up with the original address stored in the return probe instance
  pointing to our trampoline function. (This is a significant difference
  from the original arch_prepare_kretprobe design.)

* Target function executes like normal and then returns to kretprobe_trampoline.

* kprobe inserted on the first instruction of kretprobe_trampoline is fired
  and calls trampoline_probe_handler() (no change here)

* trampoline_probe_handler() consumes each of the instances associated with
  the current task by calling the registered handler function and marking
  the instance as unused until an instance is found that has a return address
  different then the trampoline function.

  (change similar to my previous ia64 RFC)

* If the task is killed with some left-over return probe instances (meaning
  that a target function was entered, but never returned), then we just
  free any instances associated with the task.  (Not much different other
  then we can handle this without calling architecture specific functions.)

  There is a known problem that this patch does not yet solve where
  registering a return probe flush_old_exec or flush_thread will put us
  in a bad state.  Most likely the best way to handle this is to not allow
  registering return probes on these two functions.

  (Significant change)

This patch series applies to the 2.6.12-rc6-mm1 kernel, and provides:
  * kernel/kprobes.c changes
  * i386 patch of existing return probes implementation
  * x86_64 patch of existing return probe implementation
  * ia64 implementation
  * ppc64 implementation (provided by Ananth)

This patch implements the architecture independant changes for a reworking
of the kprobes based function return probes design. Changes include:

  * Removing functions for querying a return probe instance off a stack address
  * Removing the stack_addr field from the kretprobe_instance definition,
    and adding a task pointer
  * Adding architecture specific initialization via arch_init()
  * Removing extern definitions for the architecture trampoline functions
    (this isn't needed anymore since the architecture handles the
     initialization of the kprobe in the return probe trampoline function.)

Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 15:23:52 -07:00
Ananth N Mavinakayanahalli
9ec4b1f356 [PATCH] kprobes: fix single-step out of line - take2
Now that PPC64 has no-execute support, here is a second try to fix the
single step out of line during kprobe execution.  Kprobes on x86_64 already
solved this problem by allocating an executable page and using it as the
scratch area for stepping out of line.  Reuse that.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 15:23:52 -07:00
Jens Axboe
22e2c507c3 [PATCH] Update cfq io scheduler to time sliced design
This updates the CFQ io scheduler to the new time sliced design (cfq
v3).  It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes.  It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls.  The latter closely mimic
set/getpriority.

This import is based on my latest from -mm.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 14:33:29 -07:00
Linus Torvalds
2031d0f586 Merge Christoph's freeze cleanup patch 2005-06-25 17:16:53 -07:00
Christoph Lameter
3e1d1d28d9 [PATCH] Cleanup patch for process freezing
1. Establish a simple API for process freezing defined in linux/include/sched.h:

   frozen(process)		Check for frozen process
   freezing(process)		Check if a process is being frozen
   freeze(process)		Tell a process to freeze (go to refrigerator)
   thaw_process(process)	Restart process
   frozen_process(process)	Process is frozen now

2. Remove all references to PF_FREEZE and PF_FROZEN from all
   kernel sources except sched.h

3. Fix numerous locations where try_to_freeze is manually done by a driver

4. Remove the argument that is no longer necessary from two function calls.

5. Some whitespace cleanup

6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
   cleared before setting PF_FROZEN, recalc_sigpending does not check
   PF_FROZEN).

This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 17:10:13 -07:00
Nick Wilson
8c0e33c133 [PATCH] Use ALIGN to remove duplicate code
This patch makes use of ALIGN() to remove duplicate round-up code.

Signed-off-by: Nick Wilson <njw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:25:02 -07:00
Jesper Juhl
5a6b454f80 [PATCH] remove redundant NULL check before before kfree() in kernel/sysctl.c
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:59 -07:00
Domen Puncer
96ec3efdcb [PATCH] kernel/timer: fix msleep_interruptible() comment
The comment for msleep_interruptible() is wrong, as it will ignore
wait-queue events, but will wake up early for signals.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:58 -07:00
Maneesh Soni
72414d3f1d [PATCH] kexec code cleanup
o Following patch provides purely cosmetic changes and corrects CodingStyle
  guide lines related certain issues like below in kexec related files

  o braces for one line "if" statements, "for" loops,
  o more than 80 column wide lines,
  o No space after "while", "for" and "switch" key words

o Changes:
  o take-2: Removed the extra tab before "case" key words.
  o take-3: Put operator at the end of line and space before "*/"

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:55 -07:00
Alexander Nyberg
6e274d1443 [PATCH] kdump: Use real pt_regs from exception
Makes kexec_crashdump() take a pt_regs * as an argument.  This allows to
get exact register state at the point of the crash.  If we come from direct
panic assertion NULL will be passed and the current registers saved before
crashdump.

This hooks into two places:
die(): check the conditions under which we will panic when calling
do_exit and go there directly with the pt_regs that caused the fatal
fault.

die_nmi(): If we receive an NMI lockup while in the kernel use the
pt_regs and go directly to crash_kexec(). We're probably nested up badly
at this point so this might be the only chance to escape with proper
information.

Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:54 -07:00
Vivek Goyal
666bfddbe8 [PATCH] kdump: Access dump file in elf format (/proc/vmcore)
From: "Vivek Goyal" <vgoyal@in.ibm.com>

o Support for /proc/vmcore interface. This interface exports elf core image
  either in ELF32 or ELF64 format, depending on the format in which elf headers
  have been stored by crashed kernel.
o Added support for CONFIG_VMCORE config option.
o Removed the dependency on /proc/kcore.

From: "Eric W. Biederman" <ebiederm@xmission.com>

This patch has been refactored to more closely match the prevailing style in
the affected files.  And to clearly indicate the dependency between
/proc/kcore and proc/vmcore.c

From: Hariprasad Nellitheertha <hari@in.ibm.com>

This patch contains the code that provides an ELF format interface to the
previous kernel's memory post kexec reboot.

Signed off by Hariprasad Nellitheertha <hari@in.ibm.com>
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:53 -07:00
Vivek Goyal
2030eae52b [PATCH] Retrieve elfcorehdr address from command line
This patch adds support for retrieving the address of elf core header if one
is passed in command line.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:53 -07:00
Vivek Goyal
60e64d46a5 [PATCH] kdump: Routines for copying dump pages
This patch provides the interfaces necessary to read the dump contents,
treating it as a high memory device.

Signed off by Hariprasad Nellitheertha <hari@in.ibm.com>
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:53 -07:00
Vivek Goyal
625f1c8219 [PATCH] Kdump: Export crash notes section address through sysfs
o Following patch exports kexec global variable "crash_notes" to user space
  through sysfs as kernel attribute in /sys/kernel.

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:51 -07:00
Vivek Goyal
50cccc699e [PATCH] Kexec on panic vmlinux initrd fix
This is a minor bug fix in kexec to resolve the problem of loading panic
kernel with initrd.

o Problem: Loading a capture kenrel fails if initrd is also being loaded.
  This has been observed for vmlinux image for kexec on panic case.

o This patch fixes the problem. In segment location and size verification
  logic, minor correction has been done. Segment memory end (mend) should be
  mstart + memsz - 1. This one byte offset was source of failure for initrd
  loading which was being loaded at hole boundary.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:48 -07:00
Eric W. Biederman
dc009d9243 [PATCH] kexec: add kexec syscalls
This patch introduces the architecture independent implementation the
sys_kexec_load, the compat_sys_kexec_load system calls.

Kexec on panic support has been integrated into the core patch and is
relatively clean.

In addition the hopefully architecture independent option
crashkernel=size@location has been docuemented.  It's purpose is to reserve
space for the panic kernel to live, and where no DMA transfer will ever be
setup to access.

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:48 -07:00
Ingo Molnar
f8cbd99bd3 [PATCH] sched: voluntary kernel preemption
This patch adds a new preemption model: 'Voluntary Kernel Preemption'.  The
3 models can be selected from a new menu:

            (X) No Forced Preemption (Server)
            ( ) Voluntary Kernel Preemption (Desktop)
            ( ) Preemptible Kernel (Low-Latency Desktop)

we still default to the stock (Server) preemption model.

Voluntary preemption works by adding a cond_resched()
(reschedule-if-needed) call to every might_sleep() check.  It is lighter
than CONFIG_PREEMPT - at the cost of not having as tight latencies.  It
represents a different latency/complexity/overhead tradeoff.

It has no runtime impact at all if disabled.  Here are size stats that show
how the various preemption models impact the kernel's size:

    text    data     bss     dec     hex filename
 3618774  547184  179896 4345854  424ffe vmlinux.stock
 3626406  547184  179896 4353486  426dce vmlinux.voluntary   +0.2%
 3748414  548640  179896 4476950  445016 vmlinux.preempt     +3.5%

voluntary-preempt is +0.2% of .text, preempt is +3.5%.

This feature has been tested for many months by lots of people (and it's
also included in the RHEL4 distribution and earlier variants were in Fedora
as well), and it's intended for users and distributions who dont want to
use full-blown CONFIG_PREEMPT for one reason or another.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:45 -07:00
Ingo Molnar
f704f56af9 [PATCH] enable PREEMPT_BKL on !PREEMPT+SMP too
The only sane way to clean up the current 3 lock_kernel() variants seems to
be to remove the spinlock-based BKL implementations altogether, and to keep
the semaphore-based one only.  If we dont want to do that for whatever
reason then i'm afraid we have to live with the current complexity.  (but
i'm open for other cleanup suggestions as well.)

To explore this possibility we'll (at a minimum) have to know whether the
semaphore-based BKL works fine on plain SMP too.  The patch below enables
this.

The patch may make sense in isolation as well, as it might bring
performance benefits: code that would formerly spin on the BKL spinlock
will now schedule away and give up the CPU.  It might introduce performance
regressions as well, if any performance-critical code uses the BKL heavily
and gets overscheduled due to the semaphore.  I very much hope there is no
such performance-critical codepath left though.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:45 -07:00
Ingo Molnar
cc19ca86a0 [PATCH] consolidate PREEMPT options into kernel/Kconfig.preempt
This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL
preemption options into kernel/Kconfig.preempt.  This, besides reducing
source-code, also enables more centralized tweaking of preemption related
options.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:45 -07:00
Dinakar Guniguntala
85d7b94981 [PATCH] Dynamic sched domains: cpuset changes
Adds the core update_cpu_domains code and updated cpusets documentation

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:45 -07:00
Dinakar Guniguntala
1a20ff27ef [PATCH] Dynamic sched domains: sched changes
The following patches add dynamic sched domains functionality that was
extensively discussed on lkml and lse-tech.  I would like to see this added to
-mm

o The main advantage with this feature is that it ensures that the scheduler
  load balacing code only balances against the cpus that are in the sched
  domain as defined by an exclusive cpuset and not all of the cpus in the
  system. This removes any overhead due to load balancing code trying to
  pull tasks outside of the cpu exclusive cpuset only to be prevented by
  the tasks' cpus_allowed mask.
o cpu exclusive cpusets are useful for servers running orthogonal
  workloads such as RT applications requiring low latency and HPC
  applications that are throughput sensitive

o It provides a new API partition_sched_domains in sched.c
  that makes dynamic sched domains possible.
o cpu_exclusive cpusets sets are now associated with a sched domain.
  Which means that the users can dynamically modify the sched domains
  through the cpuset file system interface
o ia64 sched domain code has been updated to support this feature as well
o Currently, this does not support hotplug. (However some of my tests
  indicate hotplug+preempt is currently broken)
o I have tested it extensively on x86.
o This should have very minimal impact on performance as none of
  the fast paths are affected

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:45 -07:00
Olivier Croquette
37e4ab3f0c [PATCH] Changing RT priority without CAP_SYS_NICE
Presently, a process without the capability CAP_SYS_NICE can not change
its own policy, which is OK.

But it can also not decrease its RT priority (if scheduled with policy
SCHED_RR or SCHED_FIFO), which is what this patch changes.

The rationale is the same as for the nice value: a process should be
able to require less priority for itself. Increasing the priority is
still not allowed.

This is for example useful if you give a multithreaded user process a RT
priority, and the process would like to organize its internal threads
using priorities also. Then you can give the process the highest
priority needed N, and the process starts its threads with lower
priorities: N-1, N-2...

The POSIX norm says that the permissions are implementation specific, so
I think we can do that.

In a sense, it makes the permissions consistent whatever the policy is:
with this patch, process scheduled by SCHED_FIFO, SCHED_RR and
SCHED_OTHER can all decrease their priority.

From: Ingo Molnar <mingo@elte.hu>

cleaned up and merged to -mm.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Chen Shang
a3464a102a [PATCH] sched: micro-optimize task requeueing in schedule()
micro-optimize task requeueing in schedule() & clean up recalc_task_prio().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
77391d7168 [PATCH] sched: relax pinned balancing
The maximum rebalance interval allowed by the multiprocessor balancing
backoff is often not large enough to handle corner cases where there are
lots of tasks pinned on a CPU.  Suresh reported:

	I see system livelock's if for example I have 7000 processes
	pinned onto one cpu (this is on the fastest 8-way system I
	have access to).

After this patch, the machine is reported to go well above this number.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
476d139c21 [PATCH] sched: consolidate sbe sbf
Consolidate balance-on-exec with balance-on-fork.  This is made easy by the
sched-domains RCU patches.

As well as the general goodness of code reduction, this allows the runqueues
to be unlocked during balance-on-fork.

schedstats is a problem.  Maybe just have balance-on-event instead of
distinguishing fork and exec?

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
674311d5b4 [PATCH] sched: RCU domains
One of the problems with the multilevel balance-on-fork/exec is that it needs
to jump through hoops to satisfy sched-domain's locking semantics (that is,
you may traverse your own domain when not preemptable, and you may traverse
others' domains when holding their runqueue lock).

balance-on-exec had to potentially migrate between more than one CPU before
finding a final CPU to migrate to, and balance-on-fork needed to potentially
take multiple runqueue locks.

So bite the bullet and make sched-domains go completely RCU.  This actually
simplifies the code quite a bit.

From: Ingo Molnar <mingo@elte.hu>

schedstats RCU fix, and a nice comment on for_each_domain, from Ingo.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:44 -07:00
Nick Piggin
3dbd534207 [PATCH] sched: multilevel sbe sbf
The fundamental problem that Suresh has with balance on exec and fork is that
it only tries to balance the top level domain with the flag set.

This was worked around by removing degenerate domains, but is still a problem
if people want to start using more complex sched-domains, especially
multilevel NUMA that ia64 is already using.

This patch makes balance on fork and exec try balancing over not just the top
most domain with the flag set, but all the way down the domain tree.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Suresh Siddha
245af2c787 [PATCH] sched: remove degenerate domains
Remove degenerate scheduler domains during the sched-domain init.

For example on x86_64, we always have NUMA configured in.  On Intel EM64T
systems, top most sched domain will be of NUMA and with only one sched_group
in it.

With fork/exec balances(recent Nick's fixes in -mm tree), we always endup
taking wrong decisions because of this topmost domain (as it contains only one
group and find_idlest_group always returns NULL).  We will endup loading HT
package completely first, letting active load balance kickin and correct it.

In general, this patch also makes sense with out recent Nick's fixes in -mm.

From: Nick Piggin <nickpiggin@yahoo.com.au>

Modified to account for more than just sched_groups when scanning for
degenerate domains by Nick Piggin.  And allow a runqueue's sd to go NULL
rather than keep a single degenerate domain around (this happens when you run
with maxcpus=1).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Nick Piggin
41c7ce9ad9 [PATCH] sched: null domains
Fix the last 2 places that directly access a runqueue's sched-domain and
assume it cannot be NULL.

That allows the use of NULL for domain, instead of a dummy domain, to signify
no balancing is to happen.  No functional changes.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Nick Piggin
4866cde064 [PATCH] sched: cleanup context switch locking
Instead of requiring architecture code to interact with the scheduler's
locking implementation, provide a couple of defines that can be used by the
architecture to request runqueue unlocked context switches, and ask for
interrupts to be enabled over the context switch.

Also replaces the "switch_lock" used by these architectures with an oncpu
flag (note, not a potentially slow bitflag).  This eliminates one bus
locked memory operation when context switching, and simplifies the
task_running function.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Ingo Molnar
48c08d3f8f [PATCH] sched: uninline task_timeslice
"Chen, Kenneth W" <kenneth.w.chen@intel.com>

uninline task_timeslice() - reduces code footprint noticeably, and it's
slowpath code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:43 -07:00
Nick Piggin
68767a0ae4 [PATCH] sched: schedstats update for balance on fork
Add SCHEDSTAT statistics for sched-balance-fork.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:42 -07:00
Nick Piggin
147cbb4bbe [PATCH] sched: balance on fork
Reimplement the balance on exec balancing to be sched-domains aware.  Use this
to also do balance on fork balancing.  Make x86_64 do balance on fork over the
NUMA domain.

The problem that the non sched domains aware blancing became apparent on dual
core, multi socket opterons.  What we want is for the new tasks to be sent to
a different socket, but more often than not, we would first load up our
sibling core, or fill two cores of a single remote socket before selecting a
new one.

This gives large improvements to STREAM on such systems.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:42 -07:00
Nick Piggin
cafb20c1f9 [PATCH] sched: no aggressive idle balancing
Remove the very aggressive idle stuff that has recently gone into 2.6 - it is
going against the direction we are trying to go.  Hopefully we can regain
performance through other methods.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:42 -07:00
Nick Piggin
a3f21bce1f [PATCH] sched: tweak affine wakeups
Do less affine wakeups.  We're trying to reduce dbt2-pgsql idle time
regressions here...  make sure we don't don't move tasks the wrong way in an
imbalance condition.  Also, remove the cache coldness requirement from the
calculation - this seems to induce sharp cutoff points where behaviour will
suddenly change on some workloads if the load creeps slightly over or under
some point.  It is good for periodic balancing because in that case have
otherwise have no other context to determine what task to move.

But also make a minor tweak to "wake balancing" - the imbalance tolerance is
now set at half the domain's imbalance, so we get the opportunity to do wake
balancing before the more random periodic rebalancing gets preformed.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
7897986bad [PATCH] sched: balance timers
Do CPU load averaging over a number of different intervals.  Allow each
interval to be chosen by sending a parameter to source_load and target_load.
0 is instantaneous, idx > 0 returns a decaying average with the most recent
sample weighted at 2^(idx-1).  To a maximum of 3 (could be easily increased).

So generally a higher number will result in more conservative balancing.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
99b61ccf0b [PATCH] sched: less aggressive idle balancing
Remove the special casing for idle CPU balancing.  Things like this are
hurting for example on SMT, where are single sibling being idle doesn't really
warrant a really aggressive pull over the NUMA domain, for example.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
db935dbd43 [PATCH] sched: add debugging
These conditions should now be impossible, and we need to fix them if they
happen.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
3950745131 [PATCH] sched: fix SMT scheduling problems
SMT balancing has a couple of problems.  Firstly, active_load_balance is too
complex - basically it should be a dumb helper for when the periodic balancer
has determined there is an imbalance, but gets stuck because the task is
running.

So rip out all its "smarts", and just make it move one task to the target CPU.

Second, the busy CPU's sched-domain tree was being used for active balancing.
This means that it may not see that nr_balance_failed has reached a critical
level.  So use the target CPU's sched-domain tree for this.  We can do this
because we hold its runqueue lock.

Lastly, reset nr_balance_failed to a point where we allow cache hot migration.
This will help ensure active load balancing is successful.

Thanks to Suresh Siddha for pointing out these issues.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:41 -07:00
Nick Piggin
16cfb1c04c [PATCH] sched: reduce active load balancing
Fix up active load balancing a bit so it doesn't get called when it shouldn't.
Reset the nr_balance_failed counter at more points where we have found
conditions to be balanced.  This reduces too aggressive active balancing seen
on some workloads.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:40 -07:00
Nick Piggin
8102679447 [PATCH] sched: improve load balancing pinned tasks
John Hawkes explained the problem best:

	A large number of processes that are pinned to a single CPU results
	in every other CPU's load_balance() seeing this overloaded CPU as
	"busiest", yet move_tasks() never finds a task to pull-migrate.  This
	condition occurs during module unload, but can also occur as a
	denial-of-service using sys_sched_setaffinity().  Several hundred
	CPUs performing this fruitless load_balance() will livelock on the
	busiest CPU's runqueue lock.  A smaller number of CPUs will livelock
	if the pinned task count gets high.

Expanding slightly on John's patch, this one attempts to work out whether the
balancing failure has been due to too many tasks pinned on the runqueue.  This
allows it to be basically invisible to the regular blancing paths (ie.  when
there are no pinned tasks).  We can use this extra knowledge to shut down the
balancing faster, and ensure the migration threads don't start running which
is another problem observed in the wild.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:40 -07:00
Nick Piggin
e0f364f406 [PATCH] sched: cleanup wake_idle
New sched-domains code means we don't get spans with offline CPUs in
them.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:40 -07:00
Pavel Machek
19c324397a [PATCH] swsusp: only allow it when it makes sense
Show swsuspend only on .config where it can compile.  I got this on PPC32 &&
SMP:

kernel/power/smp.c:24: error: storage size of `ctxt' isn't known

Also mark swsusp as no longer experimental.

Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:34 -07:00
Shaohua Li
ac25575203 [PATCH] CPU hotplug printk fix
In the cpu hotplug case, per-cpu data possibly isn't initialized even the
system state is 'running'.  As the comments say in the original code, some
console drivers assume per-cpu resources have been allocated.  radeon fb is
one such driver, which uses kmalloc.  After a CPU is down, the per-cpu data
of slab is freed, so the system crashed when printing some info.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:34 -07:00
Pavel Machek
c61978b303 [PATCH] swsusp: fix nr_copy_pages
The following patch moves the recalculation of nr_copy_pages so that the right
number is used in the calculation of the size of memory and swap needed.

It prevents swsusp from attempting to suspend if there is not enough memory
and/or swap (which is unlikely anyway).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:33 -07:00
Pavel Machek
2e4d5822dc [PATCH] swsusp: cleanup whitespace
The following patch cleans up whitespace in swsusp.c (a bit):

- removes any trailing whitespace

- adds spaces after if, for, for_each_pbe, for_each_zone etc., wherever
  necessary.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:33 -07:00
Pavel Machek
8f9bdf15c0 [PATCH] swsusp: kill unneccessary does_collide_order
The following patch removes the unnecessary function does_collide_order().

This function is no longer necessary, as currently there are only 0-order
allocations in swsusp, and the use of it is confusing.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:33 -07:00
Pavel Machek
620b032764 [PATCH] properly stop devices before poweroff
Without this patch, Linux provokes emergency disk shutdowns and
similar nastiness. It was in SuSE kernels for some time, IIRC.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:33 -07:00
Li Shaohua
5a72e04df5 [PATCH] suspend/resume SMP support
Using CPU hotplug to support suspend/resume SMP.  Both S3 and S4 use
disable/enable_nonboot_cpus API.  The S4 part is based on Pavel's original S4
SMP patch.

Signed-off-by: Li Shaohua<shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:32 -07:00
Zwane Mwaikambo
f370513640 [PATCH] i386 CPU hotplug
(The i386 CPU hotplug patch provides infrastructure for some work which Pavel
is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
<shaohua.li@intel.com> is doing)

The following provides i386 architecture support for safely unregistering and
registering processors during runtime, updated for the current -mm tree.  In
order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
cpu_online check in do_IRQ() by modifying fixup_irqs().  The difference being
that on cpu offline, fixup_irqs() is called before we clear the cpu from
cpu_online_map and a long delay in order to ensure that we never have any
queued external interrupts on the APICs.  There are additional changes to s390
and ppc64 to account for this change.

1) Add CONFIG_HOTPLUG_CPU
2) disable local APIC timer on dead cpus.
3) Disable preempt around irq balancing to prevent CPUs going down.
4) Print irq stats for all possible cpus.
5) Debugging check for interrupts on offline cpus.
6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
7) play_dead() for offline cpus to spin inside.
8) Handle offline cpus set in flush_tlb_others().
9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
10) Implement __cpu_disable() and __cpu_die().
11) Enable local interrupts in cpu_enable() after fixup_irqs()
12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.

Signed-off-by: Zwane Mwaikambo <zwane@linuxpower.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25 16:24:29 -07:00
David Woodhouse
e1b09eba26 AUDIT: Use KERN_NOTICE for printk of audit records
They aren't errors.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-24 17:24:11 +01:00
David Woodhouse
5bb289b5a0 AUDIT: Clean up user message filtering
Don't look up the task by its pid and then use the syscall filtering
helper. Just implement our own filter helper which operates solely on
the information in the netlink_skb_parms. 

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-24 14:14:05 +01:00
David Woodhouse
993e2d4106 AUDIT: Return correct result from audit_filter_rules()
When the task refcounting was added to audit_filter_rules() it became
more of a problem that this function was violating the 'only one 
return from each function' rule. In fixing it to use a variable to store 
'ret' I stupidly neglected to actually change the 'return 1;' at the 
end. This makes it not work very well.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-24 08:21:49 +01:00
Adrian Bunk
52c1da3953 [PATCH] make various thing static
Another rollup of patches which give various symbols static scope

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:06:43 -07:00
Matt Domsch
c988d2b284 [PATCH] modules: add version and srcversion to sysfs
This patch adds version and srcversion files to
/sys/module/${modulename} containing the version and srcversion fields
of the module's modinfo section (if present).

/sys/module/e1000
|-- srcversion
`-- version

This patch differs slightly from the version posted in January, as it
now uses the new kstrdup() call in -mm.

Why put this in sysfs?

a) Tools like DKMS, which deal with changing out individual kernel
   modules without replacing the whole kernel, can behave smarter if they
   can tell the version of a given module.  The autoinstaller feature, for
   example, which determines if your system has a "good" version of a
   driver (i.e.  if the one provided by DKMS has a newer verson than that
   provided by the kernel package installed), and to automatically compile
   and install a newer version if DKMS has it but your kernel doesn't yet
   have that version.

b) Because sysadmins manually, or with tools like DKMS, can switch out
   modules on the file system, you can't count on 'modinfo foo.ko', which
   looks at /lib/modules/${kernelver}/...  actually matching what is loaded
   into the kernel already.  Hence asking sysfs for this.

c) as the unbind-driver-from-device work takes shape, it will be
   possible to rebind a driver that's built-in (no .ko to modinfo for the
   version) to a newly loaded module.  sysfs will have the
   currently-built-in version info, for comparison.

d) tech support scripts can then easily grab the version info for what's
   running presently - a question I get often.

There has been renewed interest in this patch on linux-scsi by driver
authors.

As the idea originated from GregKH, I leave his Signed-off-by: intact,
though the implementation is nearly completely new.  Compiled and run on
x86 and x86_64.

From: Matthew Dobson <colpatch@us.ibm.com>

      build fix

From: Thierry Vignaud <tvignaud@mandriva.com>

      build fix

From: Matthew Dobson <colpatch@us.ibm.com>

      warning fix

Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
Signed-off-by: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:06:40 -07:00
David Howells
3e30148c3d [PATCH] Keys: Make request-key create an authorisation key
The attached patch makes the following changes:

 (1) There's a new special key type called ".request_key_auth".

     This is an authorisation key for when one process requests a key and
     another process is started to construct it. This type of key cannot be
     created by the user; nor can it be requested by kernel services.

     Authorisation keys hold two references:

     (a) Each refers to a key being constructed. When the key being
     	 constructed is instantiated the authorisation key is revoked,
     	 rendering it of no further use.

     (b) The "authorising process". This is either:

     	 (i) the process that called request_key(), or:

     	 (ii) if the process that called request_key() itself had an
     	      authorisation key in its session keyring, then the authorising
     	      process referred to by that authorisation key will also be
     	      referred to by the new authorisation key.

	 This means that the process that initiated a chain of key requests
	 will authorise the lot of them, and will, by default, wind up with
	 the keys obtained from them in its keyrings.

 (2) request_key() creates an authorisation key which is then passed to
     /sbin/request-key in as part of a new session keyring.

 (3) When request_key() is searching for a key to hand back to the caller, if
     it comes across an authorisation key in the session keyring of the
     calling process, it will also search the keyrings of the process
     specified therein and it will use the specified process's credentials
     (fsuid, fsgid, groups) to do that rather than the calling process's
     credentials.

     This allows a process started by /sbin/request-key to find keys belonging
     to the authorising process.

 (4) A key can be read, even if the process executing KEYCTL_READ doesn't have
     direct read or search permission if that key is contained within the
     keyrings of a process specified by an authorisation key found within the
     calling process's session keyring, and is searchable using the
     credentials of the authorising process.

     This allows a process started by /sbin/request-key to read keys belonging
     to the authorising process.

 (5) The magic KEY_SPEC_*_KEYRING key IDs when passed to KEYCTL_INSTANTIATE or
     KEYCTL_NEGATE will specify a keyring of the authorising process, rather
     than the process doing the instantiation.

 (6) One of the process keyrings can be nominated as the default to which
     request_key() should attach new keys if not otherwise specified. This is
     done with KEYCTL_SET_REQKEY_KEYRING and one of the KEY_REQKEY_DEFL_*
     constants. The current setting can also be read using this call.

 (7) request_key() is partially interruptible. If it is waiting for another
     process to finish constructing a key, it can be interrupted. This permits
     a request-key cycle to be broken without recourse to rebooting.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-Off-By: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:05:19 -07:00
David Howells
7888e7ff4e [PATCH] Keys: Pass session keyring to call_usermodehelper()
The attached patch makes it possible to pass a session keyring through to the
process spawned by call_usermodehelper().  This allows patch 3/3 to pass an
authorisation key through to /sbin/request-key, thus permitting better access
controls when doing just-in-time key creation.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24 00:05:18 -07:00
David Woodhouse
9e94e66a5b AUDIT: No really, we don't want to audit auditd.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-23 18:33:54 +01:00
Benjamin LaHaise
c43dc2fd88 [PATCH] aio: make wait_queue ->task ->private
In the upcoming aio_down patch, it is useful to store a private data
pointer in the kiocb's wait_queue.  Since we provide our own wake up
function and do not require the task_struct pointer, it makes sense to
convert the task pointer into a generic private pointer.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:34 -07:00
Christoph Lameter
71a2224d7d [PATCH] Optimize sys_times for a single thread process
Avoid taking the tasklist_lock in sys_times if the process is single
threaded.  In a NUMA system taking the tasklist_lock may cause a bouncing
cacheline if multiple independent processes continually call sys_times to
measure their performance.

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:30 -07:00
Kirill Korotaev
4fea2838aa [PATCH] Software suspend and recalc sigpending bug fix
This patch fixes recalc_sigpending() to work correctly with tasks which are
being freezed.

The problem is that freeze_processes() sets PF_FREEZE and TIF_SIGPENDING
flags on tasks, but recalc_sigpending() called from e.g.
sys_rt_sigtimedwait or any other kernel place will clear TIF_SIGPENDING due
to no pending signals queued and the tasks won't be freezed until it
recieves a real signal or freezed_processes() fail due to timeout.

Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:27 -07:00
Alan Cox
d6e7114481 [PATCH] setuid core dump
Add a new `suid_dumpable' sysctl:

This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are

0 - (default) - traditional behaviour.  Any process which has changed
    privilege levels or is execute only will not be dumped

1 - (debug) - all processes dump core when possible.  The core dump is
    owned by the current user and no security is applied.  This is intended
    for system debugging situations only.  Ptrace is unchecked.

2 - (suidsafe) - any binary which normally would not be dumped is dumped
    readable by root only.  This allows the end user to remove such a dump but
    not access it directly.  For security reasons core dumps in this mode will
    not overwrite one another or other files.  This mode is appropriate when
    adminstrators are attempting to debug problems in a normal environment.

(akpm:

> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?

No problem to me.

> >  	if (current->euid == current->uid && current->egid == current->gid)
> >  		current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?

Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)

> Maybe this should be renamed to `dump_policy' or something.  Doing that
> would help us catch any code which isn't using the #defines, too.

Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.

)

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:26 -07:00
Prasanna S Panchamukhi
8b0914ea74 [PATCH] jprobes: allow a jprobe to coexist with muliple kprobes
Presently either multiple kprobes or only one jprobe could be inserted.
This patch removes the above limitation and allows one jprobe and multiple
kprobes to coexist at the same address.  However multiple jprobes cannot
coexist with multiple kprobes.  Currently I am working on the prototype to
allow multiple jprobes coexist with multiple kprobes.

Signed-off-by: Ananth N Mavinakayanhalli <amavin@redhat.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:25 -07:00
Prasanna S Panchamukhi
ea32c65cc2 [PATCH] kprobes: Temporary disarming of reentrant probe
In situations where a kprobes handler calls a routine which has a probe on it,
then kprobes_handler() disarms the new probe forever.  This patch removes the
above limitation by temporarily disarming the new probe.  When the another
probe hits while handling the old probe, the kprobes_handler() saves previous
kprobes state and handles the new probe without calling the new kprobes
registered handlers.  kprobe_post_handler() restores back the previous kprobes
state and the normal execution continues.

However on x86_64 architecture, re-rentrancy is provided only through
pre_handler().  If a routine having probe is referenced through
post_handler(), then the probes on that routine are disarmed forever, since
the exception stack is gets changed after the processor single steps the
instruction of the new probe.

This patch includes generic changes to support temporary disarming on
reentrancy of probes.

Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:24 -07:00
Hien Nguyen
0aa55e4d7d [PATCH] kprobes: moves lock-unlock to non-arch kprobe_flush_task
This patch moves the lock/unlock of the arch specific kprobe_flush_task()
to the non-arch specific kprobe_flusk_task().

Signed-off-by: Hien Nguyen <hien@us.ibm.com>
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:21 -07:00
Rusty Lynch
7e1048b11c [PATCH] Move kprobe [dis]arming into arch specific code
The architecture independent code of the current kprobes implementation is
arming and disarming kprobes at registration time.  The problem is that the
code is assuming that arming and disarming is a just done by a simple write
of some magic value to an address.  This is problematic for ia64 where our
instructions look more like structures, and we can not insert break points
by just doing something like:

*p->addr = BREAKPOINT_INSTRUCTION;

The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
functions:

     * void arch_arm_kprobe(struct kprobe *p)
     * void arch_disarm_kprobe(struct kprobe *p)

and then adds the new functions for each of the architectures that already
implement kprobes (spar64/ppc64/i386/x86_64).

I thought arch_[dis]arm_kprobe was the most descriptive of what was really
happening, but each of the architectures already had a disarm_kprobe()
function that was really a "disarm and do some other clean-up items as
needed when you stumble across a recursive kprobe." So...  I took the
liberty of changing the code that was calling disarm_kprobe() to call
arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
with the recursive kprobe case.

So far this patch as been tested on i386, x86_64, and ppc64, but still
needs to be tested in sparc64.

Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:21 -07:00
Hien Nguyen
b94cce926b [PATCH] kprobes: function-return probes
This patch adds function-return probes to kprobes for the i386
architecture.  This enables you to establish a handler to be run when a
function returns.

1. API

Two new functions are added to kprobes:

	int register_kretprobe(struct kretprobe *rp);
	void unregister_kretprobe(struct kretprobe *rp);

2. Registration and unregistration

2.1 Register

  To register a function-return probe, the user populates the following
  fields in a kretprobe object and calls register_kretprobe() with the
  kretprobe address as an argument:

  kp.addr - the function's address

  handler - this function is run after the ret instruction executes, but
  before control returns to the return address in the caller.

  maxactive - The maximum number of instances of the probed function that
  can be active concurrently.  For example, if the function is non-
  recursive and is called with a spinlock or mutex held, maxactive = 1
  should be enough.  If the function is non-recursive and can never
  relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
  be enough.  maxactive is used to determine how many kretprobe_instance
  objects to allocate for this particular probed function.  If maxactive <=
  0, it is set to a default value (if CONFIG_PREEMPT maxactive=max(10, 2 *
  NR_CPUS) else maxactive=NR_CPUS)

  For example:

    struct kretprobe rp;
    rp.kp.addr = /* entrypoint address */
    rp.handler = /*return probe handler */
    rp.maxactive = /* e.g., 1 or NR_CPUS or 0, see the above explanation */
    register_kretprobe(&rp);

  The following field may also be of interest:

  nmissed - Initialized to zero when the function-return probe is
  registered, and incremented every time the probed function is entered but
  there is no kretprobe_instance object available for establishing the
  function-return probe (i.e., because maxactive was set too low).

2.2 Unregister

  To unregiter a function-return probe, the user calls
  unregister_kretprobe() with the same kretprobe object as registered
  previously.  If a probed function is running when the return probe is
  unregistered, the function will return as expected, but the handler won't
  be run.

3. Limitations

3.1 This patch supports only the i386 architecture, but patches for
    x86_64 and ppc64 are anticipated soon.

3.2 Return probes operates by replacing the return address in the stack
    (or in a known register, such as the lr register for ppc).  This may
    cause __builtin_return_address(0), when invoked from the return-probed
    function, to return the address of the return-probes trampoline.

3.3 This implementation uses the "Multiprobes at an address" feature in
    2.6.12-rc3-mm3.

3.4 Due to a limitation in multi-probes, you cannot currently establish
    a return probe and a jprobe on the same function.  A patch to remove
    this limitation is being tested.

This feature is required by SystemTap (http://sourceware.org/systemtap),
and reflects ideas contributed by several SystemTap developers, including
Will Cohen and Ananth Mavinakayanahalli.

Signed-off-by: Hien Nguyen <hien@us.ibm.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@laposte.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:21 -07:00
Alexander Nyberg
df164db5fd [PATCH] avoid resursive oopses
Prevent recursive faults in do_exit() by leaving the task alone and wait
for reboot.  This may allow a more graceful shutdown and possibly save the
original oops.

Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:20 -07:00
Christoph Hellwig
5f45f1a78f [PATCH] remove duplicate get_dentry functions in various places
Various filesystem drivers have grown a get_dentry() function that's a
duplicate of lookup_one_len, except that it doesn't take a maximum length
argument and doesn't check for \0 or / in the passed in filename.

Switch all these places to use lookup_one_len.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Greg KH <greg@kroah.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:20 -07:00
Jesper Juhl
be5b4fbd01 [PATCH] preempt_count is int - remove cast and don't assign to unsigned type
In kernel/sched.c the return value from preempt_count() is cast to an int.
That made sense when preempt_count was defined as different types on is not
needed and should go away.  The patch removes the cast.

In kernel/timer.c the return value from preempt_count() is assigned to a
variable of type u32 and then that unsigned value is later compared to
preempt_count().  Since preempt_count() returns an int, an int is what
should be used to store its return value.  Storing the result in an
unsigned 32bit integer made a tiny bit of sense back when preempt_count was
different types on different archs, but no more - let's not play signed vs
unsigned comparison games when we don't have to.  The patch modifies the
code to use an int to hold the value.  While I was around that bit of code
I also made two changes to a nearby (related) printk() - I modified it to
specify the loglevel explicitly and also broke the line into a few pieces
to avoid it being longer than 80 chars and clarified the text a bit.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:19 -07:00
Greg Edwards
ab4af03a40 [PATCH] CON_CONSDEV bit not set correctly on last console
According to include/linux/console.h, CON_CONSDEV flag should be set on
the last console specified on the boot command line:

     86 #define CON_PRINTBUFFER (1)
     87 #define CON_CONSDEV     (2) /* Last on the command line */
     88 #define CON_ENABLED     (4)
     89 #define CON_BOOT        (8)

This does not currently happen if there is more than one console specified
on the boot commandline.  Instead, it gets set on the first console on the
command line.  This can cause problems for things like kdb that look for
the CON_CONSDEV flag to see if the console is valid.

Additionaly, it doesn't look like CON_CONSDEV is reassigned to the next
preferred console at unregister time if the console being unregistered
currently has that bit set.

Example (from sn2 ia64):

elilo vmlinuz root=<dev> console=ttyS0 console=ttySG0

in this case, the flags on ttySG console struct will be 0x4 (should be
0x6).

Attached patch against bk fixes both issues for the cases I looked at.  It
uses selected_console (which gets incremented for each console specified on
the command line) as the indicator of which console to set CON_CONSDEV on.
When adding the console to the list, if the previous one had CON_CONSDEV
set, it masks it out.  Tested on ia64 and x86.

The problem with the current behavior is it breaks overriding the default from
the boot line.  In the ia64 case, there may be a global append line defining
console=a in elilo.conf.  Then you want to boot your kernel, and want to
override the default by passing console=b on the boot line.  elilo constructs
the kernel cmdline by starting with the value of the global append line, then
tacks on whatever else you specify, which puts console=b last.

Signed-off-by: Greg Edwards <edwardsg@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:18 -07:00
Oleg Nesterov
f972be33ce [PATCH] posix-timers: use try_to_del_timer_sync()
sys_timer_settime/sys_timer_delete needs to delete k_itimer->real.timer
synchronously while holding ->it_lock, which is also locked in
posix_timer_fn.

This patch removes timer_active/set_timer_inactive which plays with
timer_list's internals in favour of using try_to_del_timer_sync(), which
was introduced in the previous patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:17 -07:00
Oleg Nesterov
fd450b7318 [PATCH] timers: introduce try_to_del_timer_sync()
This patch splits del_timer_sync() into 2 functions.  The new one,
try_to_del_timer_sync(), returns -1 when it hits executing timer.

It can be used in interrupt context, or when the caller hold locks which
can prevent completion of the timer's handler.

NOTE.  Currently it can't be used in interrupt context in UP case, because
->running_timer is used only with CONFIG_SMP.

Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
set_running_timer(), it is cheap.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:16 -07:00
Oleg Nesterov
55c888d6d0 [PATCH] timers fixes/improvements
This patch tries to solve following problems:

1. del_timer_sync() is racy. The timer can be fired again after
   del_timer_sync have checked all cpus and before it will recheck
   timer_pending().

2. It has scalability problems. All cpus are scanned to determine
   if the timer is running on that cpu.

   With this patch del_timer_sync is O(1) and no slower than plain
   del_timer(pending_timer), unless it has to actually wait for
   completion of the currently running timer.

   The only restriction is that the recurring timer should not use
   add_timer_on().

3. The timers are not serialized wrt to itself.

   If CPU_0 does mod_timer(jiffies+1) while the timer is currently
   running on CPU 1, it is quite possible that local interrupt on
   CPU_0 will start that timer before it finished on CPU_1.

4. The timers locking is suboptimal. __mod_timer() takes 3 locks
   at once and still requires wmb() in del_timer/run_timers.

   The new implementation takes 2 locks sequentially and does not
   need memory barriers.

Currently ->base != NULL means that the timer is pending. In that case
->base.lock is used to lock the timer. __mod_timer also takes timer->lock
because ->base can be == NULL.

This patch uses timer->entry.next != NULL as indication that the timer is
pending. So it does __list_del(), entry->next = NULL instead of list_del()
when the timer is deleted.

The ->base field is used for hashed locking only, it is initialized
in init_timer() which sets ->base = per_cpu(tvec_bases). When the
tvec_bases.lock is locked, it means that all timers which are tied
to this base via timer->base are locked, and the base itself is locked
too.

So __run_timers/migrate_timers can safely modify all timers which could
be found on ->tvX lists (pending timers).

When the timer's base is locked, and the timer removed from ->entry list
(which means that _run_timers/migrate_timers can't see this timer), it is
possible to set timer->base = NULL and drop the lock: the timer remains
locked.

This patch adds lock_timer_base() helper, which waits for ->base != NULL,
locks the ->base, and checks it is still the same.

__mod_timer() schedules the timer on the local CPU and changes it's base.
However, it does not lock both old and new bases at once. It locks the
timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
and adds this timer. This simplifies the code, because AB-BA deadlock is not
possible. __mod_timer() also ensures that the timer's base is not changed
while the timer's handler is running on the old base.

__run_timers(), del_timer() do not change ->base anymore, they only clear
pending flag.

So del_timer_sync() can test timer->base->running_timer == timer to detect
whether it is running or not.

We don't need timer_list->lock anymore, this patch kills it.

We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
before clearing timer's pending flag. It was needed because __mod_timer()
did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
could race with del_timer()->list_del(). With this patch these functions are
serialized through base->lock.

One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
adds global

        struct timer_base_s {
                spinlock_t lock;
                struct timer_list *running_timer;
        } __init_timer_base;

which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
struct are replaced by struct timer_base_s t_base.

It is indeed ugly. But this can't have scalability problems. The global
__init_timer_base.lock is used only when __mod_timer() is called for the first
time AND the timer was compile time initialized. After that the timer migrates
to the local CPU.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:16 -07:00
Christoph Lameter
5912100372 [PATCH] i386: Selectable Frequency of the Timer Interrupt
Make the timer frequency selectable. The timer interrupt may cause bus
and memory contention in large NUMA systems since the interrupt occurs
on each processor HZ times per second.

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 09:45:10 -07:00
David Woodhouse
9470178e62 AUDIT: Remove stray declaration of tsk from audit_receive_msg().
It's not used any more.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-22 15:40:55 +01:00
David Woodhouse
9ad9ad385b AUDIT: Wait for backlog to clear when generating messages.
Add a gfp_mask to audit_log_start() and audit_log(), to reduce the
amount of GFP_ATOMIC allocation -- most of it doesn't need to be 
GFP_ATOMIC. Also if the mask includes __GFP_WAIT, then wait up to
60 seconds for the auditd backlog to clear instead of immediately 
abandoning the message. 

The timeout should probably be made configurable, but for now it'll 
suffice that it only happens if auditd is actually running.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-22 15:04:33 +01:00
David Woodhouse
4a4cd633b5 AUDIT: Optimise the audit-disabled case for discarding user messages
Also exempt USER_AVC message from being discarded to preserve 
existing behaviour for SE Linux.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-22 14:56:47 +01:00
Paolo 'Blaisorblade' Giarrusso
b77d6adc92 [PATCH] uml: make hw_controller_type->release exist only for archs needing it
With Chris Wedgwood <cw@f00f.org>

As suggested by Chris, we can make the "just added" method ->release
conditional to UML only (better: to archs requesting it, i.e.  only UML
currently), so that other archs don't get this unneeded crud, and if UML
won't need it any more we can kill this.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:32 -07:00
Paolo 'Blaisorblade' Giarrusso
dbce706e25 [PATCH] uml: add and use generic hw_controller_type->release
With Chris Wedgwood <cw@f00f.org>

Currently UML must explicitly call the UML-specific
free_irq_by_irq_and_dev() for each free_irq call it's done.

This is needed because ->shutdown and/or ->disable are only called when the
last "action" for that irq is removed.

Instead, for UML shared IRQs (UML IRQs are very often, if not always,
shared), for each dev_id some setup is done, which must be cleared on the
release of that fd.  For instance, for each open console a new instance
(i.e.  new dev_id) of the same IRQ is requested().

Exactly, a fd is stored in an array (pollfds), which is after read by a
host thread and passed to poll().  Each event registered by poll() triggers
an interrupt.  So, for each free_irq() we must remove the corresponding
host fd from the table, which we do via this -release() method.

In this patch we add an appropriate hook for this, and remove all uses of
it by pointing the hook to the said procedure; this is safe to do since the
said procedure.

Also some cosmetic improvements are included.

This is heavily based on some work by Chris Wedgwood, which however didn't
get the patch merged for something I'd call a "misunderstanding" (the need
for this patch wasn't cleanly explained, thus adding the generic hook was
felt as undesirable).

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:32 -07:00
Hugh Dickins
45918e1a8b [PATCH] dup_mmap: update comment on new vma
Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
came in, try_to_unmap_one knows the vma without needing find_vma.  But add
a comment to note that here vma is inserted without mmap_sem.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:19 -07:00
Wolfgang Wander
1363c3cd86 [PATCH] Avoiding mmap fragmentation
Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.

The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.

The problem is twofold:

  1) the free_area_cache is used to continue a search for memory where
     the last search ended.  Before the change new areas were always
     searched from the base address on.

     So now new small areas are cluttering holes of all sizes
     throughout the whole mmap-able region whereas before small holes
     tended to close holes near the base leaving holes far from the base
     large and available for larger requests.

  2) the free_area_cache also is set to the location of the last
     munmap-ed area so in scenarios where we allocate e.g.  five regions of
     1K each, then free regions 4 2 3 in this order the next request for 1K
     will be placed in the position of the old region 3, whereas before we
     appended it to the still active region 1, placing it at the location
     of the old region 2.  Before we had 1 free region of 2K, now we only
     get two free regions of 1K -> fragmentation.

The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache.  If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.

The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.

Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.

Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.

Signed-off-by: Wolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:16 -07:00
Martin Hicks
753ee72896 [PATCH] VM: early zone reclaim
This is the core of the (much simplified) early reclaim.  The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.

One of the major uses of this is NUMA machines.  With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.

This adds a zone tuneable to enable early zone reclaim.  It is selected on a
per-zone basis and is turned on/off via syscall.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

			wall  user   sys   %cpu  ctx sw.  sleeps
			----  ----   ---   ----   ------  ------
No patch		1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Ingo Molnar
39c715b717 [PATCH] smp_processor_id() cleanup
This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.

The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.

Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.

In the new code, there are two externally visible symbols:

 - smp_processor_id(): debug variant.

 - raw_smp_processor_id(): nondebug variant. Replaces all existing
   uses of _smp_processor_id() and __smp_processor_id(). Defined
   by every SMP architecture in include/asm-*/smp.h.

There is one new internal symbol, dependent on DEBUG_PREEMPT:

 - debug_smp_processor_id(): internal debug variant, mapped to
                             smp_processor_id().

Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file.  All related comments got updated and/or
clarified.

I have build/boot tested the following 8 .config combinations on x86:

 {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}

I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT.  (Other
architectures are untested, but should work just fine.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:13 -07:00
David Woodhouse
f6a789d198 AUDIT: Spawn kernel thread to list filter rules.
If we have enough rules to fill the netlink buffer space, it'll 
deadlock because auditctl isn't ever actually going to read from the 
socket until we return, and we aren't going to return until it 
reads... so we spawn a kernel thread to spew out the list and then
exit.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-21 16:22:01 +01:00
Dmitry Torokhov
70f2817a43 [PATCH] sysfs: (rest) if show/store is missing return -EIO
sysfs: fix the rest of the kernel so if an attribute doesn't
       implement show or store method read/write will return
       -EIO instead of 0 or -EINVAL or -EPERM.

Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2005-06-20 15:15:03 -07:00
David Woodhouse
ae7b961b1c AUDIT: Report lookup flags with path/inode records.
When LOOKUP_PARENT is used, the inode which results is not the inode
found at the pathname. Report the flags so that this doesn't generate
misleading audit records.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-20 16:11:05 +01:00
David Woodhouse
f7056d64ae AUDIT: Really exempt auditd from having its actions audited.
We were only avoiding it on syscall exit before; now stop _everything_.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-20 16:07:33 +01:00
David Woodhouse
d6e0e1585a AUDIT: Drop user-generated messages immediately while auditing disabled.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-20 16:02:09 +01:00
David Woodhouse
0f45aa18e6 AUDIT: Allow filtering of user messages
Turn the field from a bitmask to an enumeration and add a list to allow 
filtering of messages generated by userspace. We also define a list for 
file system watches in anticipation of that feature.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-06-19 19:35:50 +01:00
David Woodhouse
0107b3cf32 Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-06-18 08:36:46 +01:00
Ingo Molnar
caf2857ac6 [PATCH] timer exit cleanup
Do all timer zapping in exit_itimers.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-17 10:03:50 -07:00
Jan Kara
6df3cecbb9 [PATCH] cond_resched_lock() fix
On one path, cond_resched_lock() fails to return true if it dropped the lock.
We think this might be causing the crashes in JBD's log_do_checkpoint().

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-13 20:58:58 -07:00
David Woodhouse
1c3f45ab2f Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-06-02 16:39:11 +01:00
Roman Zippel
ae92ef8a44 [PATCH] flush icache in correct context
flush_icache_range() is used in two different situation - in binfmt_elf.c &
co for user space mappings and module.c for kernel modules.  On m68k
flush_icache_range() doesn't know which data to flush, as it has separate
address spaces and the pointer argument can be valid in either address
space.

First I considered splitting flush_icache_range(), but this patch is
simpler.  Setting the correct context gives flush_icache_range() enough
information to flush the correct data.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-31 14:54:18 -07:00
John Hawkes
b60c1f6ffd [PATCH] drop note_interrupt() for per-CPU for proper scaling
The "unhandled interrupts" catcher, note_interrupt(), increments a global
desc->irq_count and grossly damages scaling of very large systems, e.g.,
>192p ia64 Altix, because of this highly contented cacheline, especially
for timer interrupts.  384p is severely crippled, and 512p is unuseable.

All calls to note_interrupt() can be disabled by booting with "noirqdebug",
but this disables the useful interrupt checking for all interrupts.

I propose eliminating note_interrupt() for all per-CPU interrupts.  This
was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
restructuring added a call to note_interrupt() for per-CPU interrupts.
Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
the desc->irq_count++ increment isn't atomic (which, if done, would make
scaling even worse).

Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-28 11:14:00 -07:00
Paul Jackson
2efe86b809 [PATCH] cpuset exit NULL dereference fix
There is a race in the kernel cpuset code, between the code
to handle notify_on_release, and the code to remove a cpuset.
The notify_on_release code can end up trying to access a
cpuset that has been removed.  In the most common case, this
causes a NULL pointer dereference from the routine cpuset_path.
However all manner of bad things are possible, in theory at least.

The existing code decrements the cpuset use count, and if the
count goes to zero, processes the notify_on_release request,
if appropriate.  However, once the count goes to zero, unless we
are holding the global cpuset_sem semaphore, there is nothing to
stop another task from immediately removing the cpuset entirely,
and recycling its memory.

The obvious fix would be to always hold the cpuset_sem
semaphore while decrementing the use count and dealing with
notify_on_release.  However we don't want to force a global
semaphore into the mainline task exit path, as that might create
a scaling problem.

The actual fix is almost as easy - since this is only an issue
for cpusets using notify_on_release, which the top level big
cpusets don't normally need to use, only take the cpuset_sem
for cpusets using notify_on_release.

This code has been run for hours without a hiccup, while running
a cpuset create/destroy stress test that could crash the existing
kernel in seconds.  This patch applies to the current -linus
git kernel.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Simon Derr <simon.derr@bull.net>
Acked-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-27 08:07:26 -07:00
David Woodhouse
8f37d47c9b AUDIT: Record working directory when syscall arguments are pathnames
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-27 12:17:28 +01:00
David Woodhouse
7551ced334 AUDIT: Defer freeing aux items until audit_free_context()
While they were all just simple blobs it made sense to just free them
as we walked through and logged them. Now that there are pointers to
other objects which need refcounting, we might as well revert to
_only_ logging them in audit_log_exit(), and put the code to free them
properly in only one place -- in audit_free_aux().

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
----------------------------------------------------------
2005-05-26 12:04:57 +01:00
Kirill Korotaev
c33880aadd [PATCH] sigkill priority fix
If SIGKILL does not have priority, we cannot instantly kill task before it
makes some unexpected job.  It can be critical, but we were unable to
reproduce this easily until Heiko Carstens <Heiko.Carstens@de.ibm.com>
reported this problem on LKML.

Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-24 20:08:13 -07:00
David Woodhouse
99e45eeac8 AUDIT: Escape comm when logging task info
It comes from the user; it needs to be escaped.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-23 21:57:41 +01:00
David Woodhouse
bccf6ae083 AUDIT: Unify auid reporting, put arch before syscall number
These changes make processing of audit logs easier. Based on a patch
from Steve Grubb <sgrubb@redhat.com>

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-23 21:35:28 +01:00
David Woodhouse
bfb4496e72 AUDIT: Assign serial number to non-syscall messages
Move audit_serial() into audit.c and use it to generate serial numbers 
on messages even when there is no audit context from syscall auditing.  
This allows us to disambiguate audit records when more than one is 
generated in the same millisecond.

Based on a patch by Steve Grubb after he observed the problem.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-21 21:08:09 +01:00
Samuel Thibault
10f02d1c59 [PATCH] spin_unlock_bh() and preempt_check_resched()
In _spin_unlock_bh(lock):
	do { \
		_raw_spin_unlock(lock); \
		preempt_enable(); \
		local_bh_enable(); \
		__release(lock); \
	} while (0)

there is no reason for using preempt_enable() instead of a simple
preempt_enable_no_resched()

Since we know bottom halves are disabled, preempt_schedule() will always
return at once (preempt_count!=0), and hence preempt_check_resched() is
useless here...

This fixes it by using "preempt_enable_no_resched()" instead of the
"preempt_enable()", and thus avoids the useless preempt_check_resched()
just before re-enabling bottom halves.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-21 10:46:48 -07:00
Steve Grubb
326e9c8ba6 AUDIT: Fix inconsistent use of loginuid vs. auid, signed vs. unsigned
The attached patch changes all occurrences of loginuid to auid. It also 
changes everything to %u that is an unsigned type.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-21 00:22:31 +01:00
Steve Grubb
05474106a4 AUDIT: Fix AVC_USER message passing.
The original AVC_USER message wasn't consolidated with the new range of
user messages. The attached patch fixes the kernel so the old messages 
work again.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-21 00:18:37 +01:00
Stephen Smalley
011161051b AUDIT: Avoid sleeping function in SElinux AVC audit.
This patch changes the SELinux AVC to defer logging of paths to the audit
framework upon syscall exit, by saving a reference to the (dentry,vfsmount)
pair in an auxiliary audit item on the current audit context for processing
by audit_log_exit.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-21 00:15:52 +01:00
Paul Jackson
b39c4fab25 [PATCH] cpusets+hotplug+preepmt broken
This patch removes the entwining of cpusets and hotplug code in the "No
more Mr.  Nice Guy" case of sched.c move_task_off_dead_cpu().

Since the hotplug code is holding a spinlock at this point, we cannot take
the cpuset semaphore, cpuset_sem, as would seem to be required either to
update the tasks cpuset, or to scan up the nested cpuset chain, looking for
the nearest cpuset ancestor that still has some CPUs that are online.  So
we just punt and blast the tasks cpus_allowed with all bits allowed.

This reverts these lines of code to what they were before the cpuset patch.
 And it updates the cpuset Doc file, to match.

The one known alternative to this that seems to work came from Dinakar
Guniguntala, and required the hotplug code to take the cpuset_sem semaphore
much earlier in its processing.  So far as we know, the increased locking
entanglement between cpusets and hot plug of this alternative approach is
not worth doing in this case.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-20 15:48:19 -07:00
David Woodhouse
fb19b4c6aa AUDIT: Honour audit_backlog_limit again.
The limit on the number of outstanding audit messages was inadvertently
removed with the switch to queuing skbs directly for sending by a kernel
thread. Put it back again.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-19 14:55:56 +01:00
David Woodhouse
7063e6c717 Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-05-19 11:54:00 +01:00
David Woodhouse
7ca0026495 AUDIT: Quis Custodiet Ipsos Custodes?
Nobody does. Really, it gets very silly if auditd is recording its
own actions.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-19 11:23:13 +01:00
David Woodhouse
b7d1125817 AUDIT: Send netlink messages from a separate kernel thread
netlink_unicast() will attempt to reallocate and will free messages if
the socket's rcvbuf limit is reached unless we give it an infinite 
timeout. So do that, from a kernel thread which is dedicated to spewing
stuff up the netlink socket.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-19 10:56:58 +01:00
Steve Grubb
168b717395 AUDIT: Clean up logging of untrusted strings
* If vsnprintf returns -1, it will mess up the sk buffer space accounting. 
This is fixed by not calling skb_put with bogus len values.

* audit_log_hex was a loop that called audit_log_vformat with %02X for each 
character. This is very inefficient since conversion from unsigned character 
to Ascii representation is essentially masking, shifting, and byte lookups. 
Also, the length of the converted string is well known - it's twice the 
original. Fixed by rewriting the function.

* audit_log_untrustedstring had no comments. This makes it hard for 
someone to understand what the string format will be.

* audit_log_d_path was never fixed to use untrustedstring. This could mess
up user space parsers. This was fixed to make a temp buffer, call d_path, 
and log temp buffer using untrustedstring. 

From: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-19 10:24:22 +01:00
David Woodhouse
209aba0324 AUDIT: Treat all user messages identically.
It's silly to have to add explicit entries for new userspace messages
as we invent them. Just treat all messages in the user range the same.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-18 10:21:07 +01:00
David Brownell
82428b62aa [PATCH] Driver Core: pm diagnostics update, check for errors
This patch includes various tweaks in the messaging that appears during
system pm state transitions:

  * Warn about certain illegal calls in the device tree, like resuming
    child before parent or suspending parent before child.  This could
    happen easily enough through sysfs, or in some cases when drivers
    use device_pm_set_parent().

  * Be more consistent about dev_dbg() tracing ... do it for resume() and
    shutdown() too, and never if the driver doesn't have that method.

  * Say which type of system sleep state is being entered.

Except for the warnings, these only affect debug messaging.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2005-05-17 14:54:54 -07:00
William Lee Irwin III
dfaa9c94b1 [PATCH] profile.c: `schedule' parsing fix
profile=schedule parsing is not quite what it should be.  First, str[7] is
'e', not ',', but then even if it did fall through, prof_on =
SCHED_PROFILING would be clobbered inside if (get_option(...)) So a small
amount of rearrangement is done in this patch to correct it.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17 07:59:21 -07:00
Matt Mackall
3c0547ba8b [PATCH] add_preferred_console() build fix
Move add_preferred_console out of CONFIG_PRINTK so serial console does the
right thing.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17 07:59:19 -07:00
Zhang, Yanmin
4f167fb491 [PATCH] spurious interrupt fix
On my IA64 machine, after kernel 2.6.12-rc3 boots, an edge-triggered
interrupt (IRQ 46) keeps triggered over and over again.  There is no IRQ 46
interrupt action handler.  It has lots of impact on performance.

Kernel 2.6.10 and its prior versions have no the problem.  Basically,
kernel 2.6.10 will mask the spurious edge interrupt if the interrupt is
triggered for the second time and its status includes
IRQ_DISABLE|IRQ_PENDING.

Originally, IA64 kernel has its own specific _irq_desc definitions in file
arch/ia64/kernel/irq.c.  The definition initiates _irq_desc[irq].status to
IRQ_DISABLE.  Since kernel 2.6.11, it was moved to architecture independent
codes, i.e.  kernel/irq/handle.c, but kernel/irq/handle.c initiates
_irq_desc[irq].status to 0 instead of IRQ_DISABLE.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-17 07:59:18 -07:00
David Woodhouse
3ec3b2fba5 AUDIT: Capture sys_socketcall arguments and sockaddrs
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-17 12:08:48 +01:00
David Woodhouse
5e014b10ef AUDIT: fix max_t thinko.
Der... if you use max_t it helps if you give it a type. 

Note to self: Always just apply the tested patches, don't try to port 
them by hand. You're not clever enough.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-13 18:50:33 +01:00
Steve Grubb
23f32d18aa AUDIT: Fix some spelling errors
I'm going through the kernel code and have a patch that corrects 
several spelling errors in comments.

From: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-13 18:35:15 +01:00
Steve Grubb
c04049939f AUDIT: Add message types to audit records
This patch adds more messages types to the audit subsystem so that audit 
analysis is quicker, intuitive, and more useful.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
---
I forgot one type in the big patch. I need to add one for user space 
originating SE Linux avc messages. This is used by dbus and nscd.

-Steve
---
Updated to 2.6.12-rc4-mm1.
-dwmw2

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-13 18:17:42 +01:00
David Woodhouse
9ea74f0655 AUDIT: Round up audit skb expansion to AUDIT_BUFSIZ.
Otherwise, we will be repeatedly reallocating, even if we're only
adding a few bytes at a time. Pointed out by Steve Grubb.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-13 16:35:19 +01:00
Chris Wright
c1b773d87e Add audit_log_type
Add audit_log_type to allow callers to specify type and pid when logging.
Convert audit_log to wrapper around audit_log_type.  Could have
converted all audit_log callers directly, but common case is default
of type AUDIT_KERNEL and pid 0.  Update audit_log_start to take type
and pid values when creating a new audit_buffer.  Move sequences that
did audit_log_start, audit_log_format, audit_set_type, audit_log_end,
to simply call audit_log_type directly.  This obsoletes audit_set_type
and audit_set_pid, so remove them.

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-11 10:55:10 +01:00
Chris Wright
197c69c6af Move ifdef CONFIG_AUDITSYSCALL to header
Remove code conditionally dependent on CONFIG_AUDITSYSCALL from audit.c.
Move these dependencies to audit.h with the rest.

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-11 10:54:05 +01:00
Chris Wright
804a6a49d8 Audit requires CONFIG_NET
Audit now actually requires netlink.  So make it depend on CONFIG_NET, 
and remove the inline dependencies on CONFIG_NET.

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-11 10:52:45 +01:00
Chris Wright
5a241d7703 AUDIT: Properly account for alignment difference in nlmsg_len.
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-11 10:43:07 +01:00
David Woodhouse
eecb0a7338 AUDIT: Fix abuse of va_args.
We're not allowed to use args twice; we need to use va_copy.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-10 18:58:51 +01:00
David Woodhouse
e3b926b4c1 AUDIT: pass size argument to audit_expand().
Let audit_expand() know how much it's expected to grow the buffer, in 
the case that we have that information to hand.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-10 18:56:08 +01:00
Steve Grubb
8c5aa40c94 AUDIT: Fix reported length of audit messages.
We were setting nlmsg_len to skb->len, but we should be subtracting
the size of the header.

From: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-10 18:53:07 +01:00
David Woodhouse
4332bdd332 AUDIT: Honour gfp_mask in audit_buffer_alloc()
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-06 15:59:57 +01:00
Chris Wright
5ac52f33b6 AUDIT: buffer audit msgs directly to skb
Drop the use of a tmp buffer in the audit_buffer, and just buffer
directly to the skb.  All header data that was temporarily stored in
the audit_buffer can now be stored directly in the netlink header in
the skb.  Resize skb as needed.  This eliminates the extra copy (and
the audit_log_move function which was responsible for copying).

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-06 15:54:53 +01:00
Chris Wright
8fc6115c2a AUDIT: expand audit tmp buffer as needed
Introduce audit_expand and make the audit_buffer use a dynamic buffer
which can be resized.  When audit buffer is moved to skb it will not
be fragmented across skb's, so we can eliminate the sklist in the
audit_buffer.  During audit_log_move, we simply copy the full buffer
into a single skb, and then audit_log_drain sends it on.

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-06 15:54:17 +01:00
Chris Wright
16e1904e69 AUDIT: Add helper functions to allocate and free audit_buffers.
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-06 15:53:34 +01:00
Steve Grubb
c2f0c7c356 The attached patch addresses the problem with getting the audit daemon
shutdown credential information. It creates a new message type 
AUDIT_TERM_INFO, which is used by the audit daemon to query who issued the 
shutdown. 

It requires the placement of a hook function that gathers the information. The 
hook is after the DAC & MAC checks and before the function returns. Racing 
threads could overwrite the uid & pid - but they would have to be root and 
have policy that allows signalling the audit daemon. That should be a 
manageable risk.

The userspace component will be released later in audit 0.7.2. When it 
receives the TERM signal, it queries the kernel for shutdown information. 
When it receives it, it writes the message and exits. The message looks 
like this:

type=DAEMON msg=auditd(1114551182.000) auditd normal halt, sending pid=2650 
uid=525, auditd pid=1685

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-06 12:38:39 +01:00
Domen Puncer
ebe8b54134 [PATCH] correctly name the Shell sort
As per http://www.nist.gov/dads/HTML/shellsort.html, this should be
referred to as a Shell sort.  Shell-Metzner is a misnomer.

Signed-off-by: Daniel Dickman <didickman@yahoo.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05 16:36:50 -07:00
Paulo Marques
b7e4e85337 [PATCH] setitimer timer expires too early
It seems that the code responsible for this is in kernel/itimer.c:126:

	p->signal->real_timer.expires = jiffies + interval;
	add_timer(&p->signal->real_timer);

If you request an interval of, lets say 900 usecs, the interval given by
timeval_to_jiffies will be 1.

If you request this when we are half-way between two timer ticks, the
interval will only give 400 usecs.

If we want to guarantee that we never ever give intervals less than
requested, the simple solution would be to change that to:

	p->signal->real_timer.expires = jiffies + interval + 1;

This however will produce pathological cases, like having a idle system
being requested 1 ms timeouts will give systematically 2 ms timeouts,
whereas currently it simply gives a few usecs less than 1 ms.

The complex (and more computationally expensive) solution would be to
check the gettimeofday time, and compute the correct number of jiffies.
This way, if we request a 300 usecs timer 200 usecs inside the timer
tick, we can wait just one tick, but not if we are 800 usecs inside the
tick. This would also mean that we would have to lock preemption during
these computations to avoid races, etc.

I've searched the archives but couldn't find this particular issue being
discussed before.

Attached is a patch to do the simple solution, in case anybody thinks
that it should be used.

Signed-Off-By: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05 16:36:41 -07:00
Ananth N Mavinakayanahalli
64f562c6df [PATCH] kprobes: Allow multiple kprobes at the same address
Allow registration of multiple kprobes at an address in an architecture
agnostic way.  Corresponding handlers will be invoked in a sequence.  But,
a kprobe and a jprobe can't (yet) co-exist at the same address.

Signed-off-by: Ananth N Mavinakayanahalli <amavin@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05 16:36:39 -07:00
Prasanna S Panchamukhi
04dea5f932 [PATCH] Kprobes: Oops! in unregister_kprobe()
kernel oops!  when unregister_kprobe() is called on a non-registered
kprobe.  This patch fixes the above problem by checking if the probe exists
before unregistering.

Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05 16:36:39 -07:00
Anton Blanchard
7d12e522ba [PATCH] ppc64: remove hidden -fno-omit-frame-pointer for schedule.c
While looking at code generated by gcc4.0 I noticed some functions still
had frame pointers, even after we stopped ppc64 from defining
CONFIG_FRAME_POINTER.  It turns out kernel/Makefile hardwires
-fno-omit-frame-pointer on when compiling schedule.c.

Create CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER and define it on architectures
that dont require frame pointers in sched.c code.

(akpm: blame me for the name)

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05 16:36:32 -07:00
David Woodhouse
075d6eb16d [PATCH] ppc32: platform-specific functions missing from kallsyms.
The PPC32 kernel puts platform-specific functions into separate sections so
that unneeded parts of it can be freed when we've booted and actually
worked out what we're running on today.

This makes kallsyms ignore those functions, because they're not between
_[se]text or _[se]inittext.  Rather than teaching kallsyms about the
various pmac/chrp/etc sections, this patch adds '_[se]extratext' markers
for kallsyms.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05 16:36:31 -07:00
David Woodhouse
bfd4bda097 Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-05-05 13:59:37 +01:00
Linus Torvalds
897f5ab2cd Automatic merge of rsync://rsync.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6.git 2005-05-04 19:52:45 -07:00
Herbert Xu
2a0a6ebee1 [NETLINK]: Synchronous message processing.
Let's recap the problem.  The current asynchronous netlink kernel
message processing is vulnerable to these attacks:

1) Hit and run: Attacker sends one or more messages and then exits
before they're processed.  This may confuse/disable the next netlink
user that gets the netlink address of the attacker since it may
receive the responses to the attacker's messages.

Proposed solutions:

a) Synchronous processing.
b) Stream mode socket.
c) Restrict/prohibit binding.

2) Starvation: Because various netlink rcv functions were written
to not return until all messages have been processed on a socket,
it is possible for these functions to execute for an arbitrarily
long period of time.  If this is successfully exploited it could
also be used to hold rtnl forever.

Proposed solutions:

a) Synchronous processing.
b) Stream mode socket.

Firstly let's cross off solution c).  It only solves the first
problem and it has user-visible impacts.  In particular, it'll
break user space applications that expect to bind or communicate
with specific netlink addresses (pid's).

So we're left with a choice of synchronous processing versus
SOCK_STREAM for netlink.

For the moment I'm sticking with the synchronous approach as
suggested by Alexey since it's simpler and I'd rather spend
my time working on other things.

However, it does have a number of deficiencies compared to the
stream mode solution:

1) User-space to user-space netlink communication is still vulnerable.

2) Inefficient use of resources.  This is especially true for rtnetlink
since the lock is shared with other users such as networking drivers.
The latter could hold the rtnl while communicating with hardware which
causes the rtnetlink user to wait when it could be doing other things.

3) It is still possible to DoS all netlink users by flooding the kernel
netlink receive queue.  The attacker simply fills the receive socket
with a single netlink message that fills up the entire queue.  The
attacker then continues to call sendmsg with the same message in a loop.

Point 3) can be countered by retransmissions in user-space code, however
it is pretty messy.

In light of these problems (in particular, point 3), we should implement
stream mode netlink at some point.  In the mean time, here is a patch
that implements synchronous processing.  

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-05-03 14:55:09 -07:00
Russ Anderson
012914dad2 [patch] MCA recovery module undefined symbol fix
The patch "MCA recovery improvements" added do_exit to mca_drv.c.
That's fine when the mca recovery code is built in the kernel
(CONFIG_IA64_MCA_RECOVERY=y) but breaks building the mca recovery
code as a module (CONFIG_IA64_MCA_RECOVERY=m).

Most users are currently building this as a module, as loading
and unloading the module provides a very convenient way to turn
on/off error recovery.

This patch exports do_exit, so mca_drv.c can build as a module.

Signed-off-by: Russ Anderson (rja@sgi.com)
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-05-03 13:58:17 -07:00
Chris Wright
0dd8e06bda [PATCH] add new audit data to last skb
When adding more formatted audit data to an skb for delivery to userspace,
the kernel will attempt to reuse an skb that has spare room.  However, if
the audit message has already been fragmented to multiple skb's, the search
for spare room in the skb uses the head of the list.  This will corrupt the
audit message with trailing bytes being placed midway through the stream.
Fix is to look at the end of the list.

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-05-03 14:01:15 +01:00
David Woodhouse
27b030d58c Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git 2005-05-03 08:14:09 +01:00
Adrian Bunk
408b664a7d [PATCH] make lots of things static
Another large rollup of various patches from Adrian which make things static
where they were needlessly exported.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:29 -07:00
Martin Waitz
67be2dd1ba [PATCH] DocBook: fix some descriptions
Some KernelDoc descriptions are updated to match the current code.
No code changes.

Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:26 -07:00
Pavel Pisa
4dc3b16ba1 [PATCH] DocBook: changes and extensions to the kernel documentation
I have recompiled Linux kernel 2.6.11.5 documentation for me and our
university students again.  The documentation could be extended for more
sources which are equipped by structured comments for recent 2.6 kernels.  I
have tried to proceed with that task.  I have done that more times from 2.6.0
time and it gets boring to do same changes again and again.  Linux kernel
compiles after changes for i386 and ARM targets.  I have added references to
some more files into kernel-api book, I have added some section names as well.
 So please, check that changes do not break something and that categories are
not too much skewed.

I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
by kernel convention.  Most of the other changes are modifications in the
comments to make kernel-doc happy, accept some parameters description and do
not bail out on errors.  Changed <pid> to @pid in the description, moved some
#ifdef before comments to correct function to comments bindings, etc.

You can see result of the modified documentation build at
  http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz

Some more sources are ready to be included into kernel-doc generated
documentation.  Sources has been added into kernel-api for now.  Some more
section names added and probably some more chaos introduced as result of quick
cleanup work.

Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:25 -07:00
Jesper Juhl
7ed20e1ad5 [PATCH] convert that currently tests _NSIG directly to use valid_signal()
Convert most of the current code that uses _NSIG directly to instead use
valid_signal().  This avoids gcc -W warnings and off-by-one errors.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:14 -07:00
Stephen Rothwell
7d87e14c23 [PATCH] consolidate sys_shmat
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:12 -07:00
Paul E. McKenney
fbd568a3e6 [PATCH] Change synchronize_kernel to _rcu and _sched
This patch changes calls to synchronize_kernel(), deprecated in the earlier
"Deprecate synchronize_kernel, GPL replacement" patch to instead call the new
synchronize_rcu() and synchronize_sched() APIs.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:04 -07:00
Paul E. McKenney
9b06e81898 [PATCH] Deprecate synchronize_kernel, GPL replacement
The synchronize_kernel() primitive is used for quite a few different purposes:
waiting for RCU readers, waiting for NMIs, waiting for interrupts, and so on.
This makes RCU code harder to read, since synchronize_kernel() might or might
not have matching rcu_read_lock()s.  This patch creates a new
synchronize_rcu() that is to be used for RCU readers and a new
synchronize_sched() that is used for the rest.  These two new primitives
currently have the same implementation, but this is might well change with
additional real-time support.  Both new primitives are GPL-only, the old
primitive is deprecated.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:04 -07:00
Paul E. McKenney
66cf8f1443 [PATCH] kernel/rcupdate.c: make the exports EXPORT_SYMBOL_GPL
The gpl exports need to be put back.  Moving them to GPL -- but in a
measured manner, as I proposed on this list some months ago -- is fine.
Changing these particular exports precipitously is most definitely -not-
fine.  Here is my earlier proposal:

	http://marc.theaimsgroup.com/?l=linux-kernel&m=110520930301813&w=2

See below for a patch that puts the exports back, along with an updated
version of my earlier patch that starts the process of moving them to GPL.
I will also be following this message with RFC patches that introduce two
(EXPORT_SYMBOL_GPL) interfaces to replace synchronize_kernel(), which then
becomes deprecated.

Signed-off-by: <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:03 -07:00
Matt Mackall
d59745ce3e [PATCH] clean up kernel messages
Arrange for all kernel printks to be no-ops.  Only available if
CONFIG_EMBEDDED.

This patch saves about 375k on my laptop config and nearly 100k on minimal
configs.

Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:02 -07:00
Matt Mackall
e43379f10b [PATCH] nice and rt-prio rlimits
Add a pair of rlimits for allowing non-root tasks to raise nice and rt
priorities. Defaults to traditional behavior. Originally written by
Chris Wright.

The patch implements a simple rlimit ceiling for the RT (and nice) priorities
a task can set.  The rlimit defaults to 0, meaning no change in behavior by
default.  A value of 50 means RT priority levels 1-50 are allowed.  A value of
100 means all 99 privilege levels from 1 to 99 are allowed.  CAP_SYS_NICE is
blanket permission.

(akpm: see http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.1/1921.html for
tips on integrating this with PAM).

Signed-off-by: Matt Mackall <mpm@selenic.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:00 -07:00
akpm@osdl.org
d59dd4620f [PATCH] use smp_mb/wmb/rmb where possible
Replace a number of memory barriers with smp_ variants.  This means we won't
take the unnecessary hit on UP machines.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:58:47 -07:00
Linus Torvalds
c06fec5022 Remove bogus BUG() in kernel/exit.c
It's old sanity checking that may have been useful for debugging, but
is just bogus these days.

Noticed by Mattia Belletti.
2005-04-29 09:37:07 -07:00
Steve Grubb
456be6cd90 [AUDIT] LOGIN message credentials
Attached is a new patch that solves the issue of getting valid credentials 
into the LOGIN message. The current code was assuming that the audit context 
had already been copied. This is not always the case for LOGIN messages.

To solve the problem, the patch passes the task struct to the function that 
emits the message where it can get valid credentials.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 17:30:07 +01:00
Chris Wright
37509e749d [AUDIT] Requeue messages at head of queue, up to audit_backlog
If netlink_unicast() fails, requeue the skb back at the head of the queue
it just came from, instead of the tail. And do so unless we've exceeded
the audit_backlog limit; not according to some other arbitrary limit.

From: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 17:19:14 +01:00
Serge Hallyn
c94c257c88 Add audit uid to netlink credentials
Most audit control messages are sent over netlink.In order to properly
log the identity of the sender of audit control messages, we would like
to add the loginuid to the netlink_creds structure, as per the attached
patch.

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 16:27:17 +01:00
85c8721ff3 audit: update pointer to userspace tools, remove emacs mode tags 2005-04-29 16:23:29 +01:00
Peter Martuccelli
c7fcb0ee74 [AUDIT] Avoid using %*.*s format strings.
They don't seem to work correctly (investigation ongoing), but we don't
actually need to do it anyway.

Patch from Peter Martuccelli <peterm@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 16:10:24 +01:00
Steve Grubb
d812ddbb89 [AUDIT] Fix signedness of 'serial' in various routines.
Attached is a patch that corrects a signed/unsigned warning. I also noticed
that we needlessly init serial to 0. That only needs to occur if the kernel
was compiled without the audit system.

-Steve Grubb

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 16:09:52 +01:00
2fd6f58ba6 [AUDIT] Don't allow ptrace to fool auditing, log arch of audited syscalls.
We were calling ptrace_notify() after auditing the syscall and arguments,
but the debugger could have _changed_ them before the syscall was actually
invoked. Reorder the calls to fix that.

While we're touching ever call to audit_syscall_entry(), we also make it
take an extra argument: the architecture of the syscall which was made,
because some architectures allow more than one type of syscall.

Also add an explicit success/failure flag to audit_syscall_exit(), for
the benefit of architectures which return that in a condition register
rather than only returning a single register.

Change type of syscall return value to 'long' not 'int'.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 16:08:28 +01:00
Andrew Morton
81b7854d52 audit_log_untrustedstring() warning fix
kernel/audit.c: In function `audit_log_untrustedstring':
kernel/audit.c:736: warning: comparison is always false due to limited range of data type

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 15:59:11 +01:00
83c7d09173 AUDIT: Avoid log pollution by untrusted strings.
We log strings from userspace, such as arguments to open(). These could
be formatted to contain \n followed by fake audit log entries. Provide
a function for logging such strings, which gives a hex dump when the
string contains anything but basic printable ASCII characters. Use it
for logging filenames.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2005-04-29 15:54:44 +01:00
Christoph Lameter
9acf6597c5 [PATCH] time interpolator: Fix settimeofday inaccuracy
settimeofday will set the time a little bit too early on systems using
time interpolation since it subtracts the current interpolator offset
from the time. This used to be necessary with the code in 2.6.9 and earlier
but the new code resets the time interpolator after setting the time.
Thus the time is set too early and gettimeofday will return a time slightly
before the time specified with settimeofday if invoked immeditely after
settimeofday.

This removes the obsolete subtraction of the time interpolator offset
and makes settimeofday set the time accurately. 

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-28 08:13:58 -07:00
Tom 'spot' Callaway
a271c241a6 [SPARC]: Stop-A printk cleanup
This patch is incredibly trivial, but it does resolve some of the user
confusion as to what "L1-A" actually is.

Clarify printk message to refer to Stop-A (L1-A).

Gentoo has a virtually identical patch in their kernel sources.

Signed-off-by: Tom 'spot' Callaway <tcallawa@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-04-24 20:38:02 -07:00
Ingo Molnar
238628edb6 [PATCH] sched: fix signed comparisons of long long
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-18 10:58:36 -07:00
Stephen Smalley
219f081703 [PATCH] SELinux: fix deadlock on dcache lock
This fixes a deadlock on the dcache lock detected during testing at IBM
by moving the logging of the current executable information from the
SELinux avc_audit function to audit_log_exit (via an audit_log_task_info
helper) for processing upon syscall exit. 

For consistency, the patch also removes the logging of other
task-related information from avc_audit, deferring handling to
audit_log_exit instead. 

This allows simplification of the avc_audit code, allows the exe
information to be obtained more reliably, always includes the comm
information (useful for scripts), and avoids including bogus task
information for checks performed from irq or softirq. 

Signed-off-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by:  James Morris <jmorris@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-18 10:47:35 -07:00
Coywolf Qi Hunt
6c46ada700 [PATCH] reparent_to_init cleanup
This patch hides reparent_to_init().  reparent_to_init() should only be
called by daemonize().

Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:26:01 -07:00
Benoit Boissinot
9a8488965d [PATCH] cpuset: remove function attribute const
gcc-4 warns with
include/linux/cpuset.h:21: warning: type qualifiers ignored on function
return type

cpuset_cpus_allowed is declared with const
extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);

First const should be __attribute__((const)), but the gcc manual
explains that:

"Note that a function that has pointer arguments and examines the data
pointed to must not be declared const. Likewise, a function that calls a
non-const function usually must not be const. It does not make sense for
a const function to return void."

The following patch remove const from the function declaration.

Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:25:59 -07:00
Lennert Buytenhek
b52402c783 [PATCH] pci enumeration on ixp2000: overflow in kernel/resource.c
IXP2000 (ARM-based) platforms use a separate 'struct resource' for PCI MEM
space.  Resource allocation for PCI BARs always fails because the 'root'
resource (the IXP2000 PCI MEM resource) always has the entire address space
(00000000-ffffffff) free, and find_resource() calculates the size of that
range as ffffffff-00000000+1=0, so all allocations fail because it thinks
there is no space.

(akpm: pls. double-check)

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:25:58 -07:00
Christoph Hellwig
6cae60feb6 [PATCH] kill #ifndef HAVE_ARCH_GET_SIGNAL_TO_DELIVER in signal.c
Now that no architectures defines HAVE_ARCH_GET_SIGNAL_TO_DELIVER anymore
this can go away.  It was a transitional hack only.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:25:47 -07:00
Bert Wesarg
31143a1204 [PATCH] kernel/param.c: don't use .max when .num is NULL in param_array_set()
there seems to be a bug, at least for me, in kernel/param.c for arrays with
.num == NULL.  If .num == NULL, the function param_array_set() uses &.max
for the call to param_array(), wich alters the .max value to the number of
arguments.  The result is, you can't set more array arguments as the last
time you set the parameter.

example:

# a module 'example' with
# static int array[10] = { 0, };
# module_param_array(array, int, NULL, 0644);

$ insmod example.ko array=1,2,3
$ cat /sys/module/example/parameters/array
1,2,3
$ echo "4,3,2,1" > /sys/module/example/parameters/array
$ dmesg | tail -n 1
kernel: array: can take only 3 arguments

Signed-off-by: Bert Wesarg <wesarg@informatik.uni-halle.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:25:42 -07:00
Alexander Nyberg
43117a0828 [PATCH] swsusp: SMP fix
Fix some smp_processor_id-in-preemptible warnings

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:25:39 -07:00
David S. Miller
51410d3c53 [PATCH] Fix get_compat_sigevent()
I have no idea how a bug like this lasted so long.  Anyways, obvious
memset()'ing of incorrect pointer.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:24:01 -07:00
James Bottomley
81ddef77bb [PATCH] re-export cancel_rearming_delayed_workqueue
This was unexported by Arjan because we have no current users.

However, during a conversion from tasklets to workqueues of the parisc led
functions, we ran across a case where this was needed.  In particular, the
open coded equivalent of cancel_rearming_delayed_workqueue was implemented
incorrectly, which is, I think, all the evidence necessary that this is a
useful API.

Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16 15:23:59 -07:00
Linus Torvalds
1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00